ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Full Citation in the ACM Digital Library

SESSION: Invited Presentations

Programming and execution models for next generation code intelligence systems (keynote)

McCarthy defined Artificial Intelligence (AI) as “the science and engineering of making intelligent computer programs”. I define Code Intelligence (CI) by specializing his definition to intelligent computer programs that analyze other computer programs (PAPs for short) and use it to structure the keynote in two parts. The first part argues for more research on engineering foundations of PAPs. The second part outlooks new directions of combining rule-based intelligence of current PAPs with learning capabilities. Together, novel engineering foundations and enhanced intelligence characterize what I think will be the next generation of PAPs.

The 4ps: product, process, people, and productivity: a data-driven approach to improve software engineering (keynote)

In this talk I will provide a broad overview on developer productivity and dive deep into specific analysis related to how product, process and the people impact productivity. I will use examples from industry on effort estimation and defect prediction in product, distributed development in process and the ramp up of new employees in the people category. The talk will also cover interventions via tools and process changes and their impact and discuss future challenges. This talk will be based on previously published work.

Interactive analysis of large code bases (invited talk)

Current static source code analyzers can be slow, hard to use correctly, and expensive. If not properly configured, they can also generate large amounts of output, even for well-written code. To fix this, we developed a new tool called Cobra. The Cobra tool can be used interactively even on very large code bases, which means that it is very fast. It is also designed to be easy to use and free.

The tool comes with a library of predefined queries to catch standard coding issues, including cyber-security related risks from the CVE database. The query libraries can be used as provided, but the best part is that you can also easily modify and extend those queries, or add your own. The Cobra tool is language neutral: you can use it to check C or C++ code, or Java, Ada, Python, or even English prose.

I'll show how the tool works, what makes it fast, and how you can write powerful queries in a couple of different ways.

Managers hate uncertainty: good and bad experiences with adaptive project management (invited talk)

There is intrinsic uncertainty in product development. Uncertainty is often confused with risk, where unknowns can be identified in advance. Traditional predictive project management methods are based on risk management and it will be explained why these methods often are less effective in domains with higher uncertainty. Adaptive methods seem to be more effective, as they better deal with that uncertainty. Examples will be given where adaptive methods have led to higher predictability. But many managers still often believe predictive methods lead to higher predictability. We will formulate some hypotheses on why managers still prefer predictive methods, based on own experience and initial research. Further research is required to validate these hypotheses.

Industrial best practices for continuous integration (CI) and continuously delivery (CD) (invited talk)

Industrial best practices for Continuous Integration (CI) and Continuously Delivery (CD) are constantly evolving, and the state of the art is advancing quickly across the industry. In this talk I will discuss the best practices that have been implemented by many top software companies across the industry including Google, Netflix, and VMWare where I have worked. I will discuss how these systems are being optimized to reduce human and machine resources required to qualify software releases and reduce the risk of latent defects.

Huawei’s practices on trusted software engineering capability improvement (invited talk)

Human society is rapidly transforming to the digital society, the new technologies drive digital and intelligent transformation in all industries. These technologies promise cost savings, efficiency gains, and new value. At the same time, we see growing challenges relating to cyber security and privacy protection and functional safety etc. As a leading ICT product and service provider, Huawei has been committed to providing customers with high-quality and user-friendly products and services. Two years ago, Huawei initiated the Transformation Program for Software Engineering Capability Enhancement, to improve company-wide software engineering capabilities, improve the trustworthiness of both processes and results, and to provide trustworthy and quality products. This talk will share the practices, progress, and challenges of the transformation.

SESSION: Cyber-Physical Systems

Hazard analysis for human-on-the-loop interactions in sUAS systems

With the rise of new AI technologies, autonomous systems are moving towards a paradigm in which increasing levels of responsibility are shifted from the human to the system, creating a transition from human-in-the-loop systems to human-on-the-loop (HoTL) systems. This has a significant impact on the safety analysis of such systems, as new types of errors occurring at the boundaries of human-machine interactions need to be taken into consideration. Traditional safety analysis typically focuses on system-level hazards with little focus on user-related or user-induced hazards that can cause critical system failures. To address this issue, we construct domain-level safety analysis assets for sUAS (small unmanned aerial systems) applications and describe the process we followed to explicitly, and systematically identify Human Interaction Points (HiPs), Hazard Factors and Mitigations from system hazards. We evaluate our approach by first investigating the extent to which recent sUAS incidents are covered by our hazard trees, and second by performing a study with six domain experts using our hazard trees to identify and document hazards for sUAS usage scenarios. Our study showed that our hazard trees provided effective coverage for a wide variety of sUAS application scenarios and were useful for stimulating safety thinking and helping users to identify and potentially mitigate human-interaction hazards.

An exploratory study of autopilot software bugs in unmanned aerial vehicles

Unmanned aerial vehicles (UAVs) are becoming increasingly important and widely used in modern society. Software bugs in these systems can cause severe issues, such as system crashes, hangs, and undefined behaviors. Some bugs can also be exploited by hackers to launch security attacks, resulting in catastrophic consequences. Therefore, techniques that can help detect and fix software bugs in UAVs are highly desirable. However, although there are many existing studies on bugs in various types of software, the characteristics of UAV software bugs have never been systematically studied. This impedes the development of tools for assuring the dependability of UAVs. To bridge this gap, we conducted the first large-scale empirical study on two well-known open-source autopilot software platforms for UAVs, namely PX4 and Ardupilot, to characterize bugs in UAVs. Through analyzing 569 bugs from these two projects, we observed eight types of UAV-specific bugs (i.e., limit, math, inconsistency, priority, parameter, hardware support, correction, and initialization) and learned their root causes. Based on the bug taxonomy, we summarized common bug patterns and repairing strategies. We further identified five challenges associated with detecting and fixing such UAV-specific bugs. Our study can help researchers and practitioners to better understand the threats to the dependability of UAV systems and facilitate the future development of UAV bug diagnosis tools.

Code integrity attestation for PLCs using black box neural network predictions

Cyber-physical systems (CPSs) are widespread in critical domains, and significant damage can be caused if an attacker is able to modify the code of their programmable logic controllers (PLCs). Unfortunately, traditional techniques for attesting code integrity (i.e. verifying that it has not been modified) rely on firmware access or roots-of-trust, neither of which proprietary or legacy PLCs are likely to provide. In this paper, we propose a practical code integrity checking solution based on privacy-preserving black box models that instead attest the input/output behaviour of PLC programs. Using faithful offline copies of the PLC programs, we identify their most important inputs through an information flow analysis, execute them on multiple combinations to collect data, then train neural networks able to predict PLC outputs (i.e. actuator commands) from their inputs. By exploiting the black box nature of the model, our solution maintains the privacy of the original PLC code and does not assume that attackers are unaware of its presence. The trust instead comes from the fact that it is extremely hard to attack the PLC code and neural networks at the same time and with consistent outcomes. We evaluated our approach on a modern six-stage water treatment plant testbed, finding that it could predict actuator states from PLC inputs with near-100% accuracy, and thus could detect all 120 effective code mutations that we subjected the PLCs to. Finally, we found that it is not practically possible to simultaneously modify the PLC code and apply discreet adversarial noise to our attesters in a way that leads to consistent (mis-)predictions.

PHYSFRAME: type checking physical frames of reference for robotic systems

A robotic system continuously measures its own motions and the external world during operation. Such measurements are with respect to some frame of reference, i.e., a coordinate system. A nontrivial robotic system has a large number of different frames and data have to be translated back-and-forth from a frame to another. The onus is on the developers to get such translation right. However, this is very challenging and error-prone, evidenced by the large number of questions and issues related to frame uses on developers' forum. Since any state variable can be associated with some frame, reference frames can be naturally modeled as variable types. We hence develop a novel type system that can automatically infer variables' frame types and in turn detect any type inconsistencies and violations of frame conventions. The evaluation on a set of 180 publicly available ROS projects shows that our system can detect 190 inconsistencies with 154 true positives. We reported 52 to developers and received 18 responses so far, with 15 fixed/acknowledged. Our technique also finds 45 violations of common practices.

SESSION: Continuous Integration and Delivery

Automating serverless deployments for DevOps organizations

DevOps unifies software development and operations in cross-functional teams to improve software delivery and operations (SDO) performance. Ideally, cross-functional DevOps teams independently deploy their services, but the correct operation of a service often demands other services, requiring coordination to ensure the correct deployment order. This issue is currently solved either with a central deployment or manual out-of-band communication across teams, e.g., via phone, chat, or email. Unfortunately, both contradict the independence of teams, hindering SDO performance—the reason why DevOps is adopted in the first place.

In this work, we conduct a study on 73 IT professionals, showing that, in practice, they resort to manual coordination for correct deployments even if they expect better SDO performance with fully automated approaches. To address this issue, we propose µs ([mju:z] “muse”), a novel IaC system automating deployment coordination in a fully decentralized fashion, still retaining compatibility with DevOps practice—in contrast to today’s solutions. We implement µs, demonstrate that it effectively enables automated coordination, introduces negligible definition overhead, has no performance overhead, and is broadly applicable, as shown by the migration of 64 third-party IaC projects.

SESSION: Mobile Analysis and Testing

Algebraic-datatype taint tracking, with applications to understanding Android identifier leaks

Current taint analyses track flow from sources to sinks, and report the results simply as source → sink pairs, or flows. This is imprecise and ineffective in many real-world scenarios; examples include taint sources that are mutually exclusive, or flows that combine sources (e.g., IMEI and MAC Address are concatenated, hashed, leaked vs. IMEI and MAC Address hashed separately and leaked separately). These shortcomings are particularly acute in the context of Android, where sensitive identifiers can be combined, processed, and then leaked, in complicated ways. To address these issues, we introduce a novel, algebraic-datatype taint analysis that generates rich yet concise taint signatures involving AND, XOR, hashing – akin to algebraic, product and sum, types. We implemented our approach as a static analysis for Android that derives app leak signatures – an algebraic representation of how, and where, hardware/software identifiers are manipulated before being exfiltrated to the network. We perform six empirical studies of algebraic-datatype taint tracking on 1,000 top apps from Google Play and their embedded libraries, including: discerning between “raw” and hashed flows which eliminates a source of imprecision in current analyses; finding apps and libraries that go against Google Play’s guidelines by (ab)using hardware identifiers; showing that third-party code, rather than app code, is the predominant source of leaks; exposing potential de-anonymization practices; and quantifying how apps have become more privacy-friendly over the past two years.

Vet: identifying and avoiding UI exploration tarpits

Despite over a decade of research, it is still challenging for mobile UI testing tools to achieve satisfactory effectiveness, especially on industrial apps with rich features and large code bases. Our experiences suggest that existing mobile UI testing tools are prone to exploration tarpits, where the tools get stuck with a small fraction of app functionalities for an extensive amount of time. For example, a tool logs out an app at early stages without being able to log back in, and since then the tool gets stuck with exploring the app’s pre-login functionalities (i.e., exploration tarpits) instead of its main functionalities. While tool vendors/users can manually hardcode rules for the tools to avoid specific exploration tarpits, these rules can hardly generalize, being fragile in face of diverted testing environments, fast app iterations, and the demand of batch testing product lines. To identify and resolve exploration tarpits, we propose VET, a general approach including a supporting system for the given specific Android UI testing tool on the given specific app under test (AUT). VET runs the tool on the AUT for some time and records UI traces, based on which VET identifies exploration tarpits by recognizing their patterns in the UI traces. VET then pinpoints the actions (e.g., clicking logout) or the screens that lead to or exhibit exploration tarpits. In subsequent test runs, VET guides the testing tool to prevent or recover from exploration tarpits. From our evaluation with state-of-the-art Android UI testing tools on popular industrial apps, VET identifies exploration tarpits that cost up to 98.6% testing time budget. These exploration tarpits reveal not only limitations in UI exploration strategies but also defects in tool implementations. VET automatically addresses the identified exploration tarpits, enabling each evaluated tool to achieve higher code coverage and improve crash-triggering capabilities.

Checking conformance of applications against GUI policies

A good graphical user interface (GUI) is crucial for an application's usability, so vendors and regulatory agencies increasingly place restrictions on how GUI elements should appear to and interact with users. Motivated by this concern, this paper presents a new technique (based on static analysis) for checking conformance between (Android) applications and GUI policies expressed in a formal specification language. In particular, this paper (1) describes a specification language for formalizing GUI policies, (2) proposes a new program abstraction called an _event-driven layout forest_, and (3) describes a static analysis for constructing this abstraction and checking it against a GUI policy. We have implemented the proposed approach in a tool called Venus, and we evaluate it on 2361 Android applications and 17 policies. Our evaluation shows that Venus can uncover malicious applications that perform ad fraud and identify violations of GUI design guidelines and GDPR laws.

SESSION: Mobile Human-Computer Interaction

Data-driven accessibility repair revisited: on the effectiveness of generating labels for icons in Android apps

Mobile apps are playing an increasingly important role in our daily lives, including the lives of approximately 304 million users worldwide that are either completely blind or suffer from some form of visual impairment. These users rely on screen readers to interact with apps. Screen readers, however, cannot describe the image icons that appear on the screen, unless those icons are accompanied with developer-provided textual labels. A prior study of over 5,000 Android apps found that in around 50% of the apps, less than 10% of the icons are labeled. To address this problem, a recent award-winning approach, called LabelDroid, employed deep-learning techniques to train a model on a dataset of existing icons with labels to automatically generate labels for visually similar, unlabeled icons. In this work, we empirically study the nature of icon labels in terms of distribution and their dependency on different sources of information. We then assess the effectiveness of LabelDroid in predicting labels for unlabeled icons. We find that icon images are insufficient in representing icon labels, while other sources of information from the icon usage context can enrich images in determining proper tokens for labels. We propose the first context-aware label generation approach, called COALA, that incorporates several sources of information from the icon in generating accurate labels. Our experiments show that although COALA significantly outperforms LabelDroid in both user study and automatic evaluation, further research is needed. We suggest that future studies should be more cautious when basing their approach on automatically extracted labeled data.

Benchmarking automated GUI testing for Android against real-world bugs

For ensuring the reliability of Android apps, there has been tremendous, continuous progress on improving automated GUI testing in the past decade. Specifically, dozens of testing techniques and tools have been developed and demonstrated to be effective in detecting crash bugs and outperform their respective prior work in the number of detected crashes. However, an overarching question "How effectively and thoroughly can these tools find crash bugs in practice?" has not been well-explored, which requires a ground-truth benchmark with real-world bugs. Since prior studies focus on tool comparisons w.r.t. some selected apps, they cannot provide direct, in-depth answers to this question.

To complement existing work and tackle the above question, this paper offers the first ground-truth empirical evaluation of automated GUI testing for Android. To this end, we devote substantial manual effort to set up the Themis benchmark set, including (1) a carefully constructed dataset with 52 real, reproducible crash bugs (taking two person-months for its collection and validation), and (2) a unified, extensible infrastructure with six recent state-of-the-art testing tools. The whole evaluation has taken over 10,920 CPU hours. We find a considerable gap in these tools finding the collected real bugs --- 18 bugs cannot be detected by any tool. Our systematic analysis further identifies five major common challenges that these tools face, and reveals additional findings such as factors affecting these tools in bug finding and opportunities for tool improvements. Overall, this work offers new concrete insights, most of which are previously unknown/unstated and difficult to obtain. Our study presents a new, complementary perspective from prior studies to understand and analyze the effectiveness of existing testing tools, as well as a benchmark for future research on this topic. The Themis benchmark is publicly available at https://github.com/the-themis-benchmarks/home.

SESSION: Model Checking

Checking LTL[F,G,X] on compressed traces in polynomial time

The problem of checking if a program execution meets a formal specification arises in many software engineering tasks including runtime verification and designing test oracles. When online analysis is not possible, execution trace logs are stored for offline postmortem analysis, often in a compressed format to reduce disk space and warehousing requirements. A straightforward method for checking if a compressed execution satisfies a property is to first decompress it and then analyze the resulting uncompressed execution.

In this paper, we consider the problem of checking if an execution trace, compressed using a grammar-based lossless compression scheme, satisfies a specification expressed in linear temporal logic, without explicitly decompressing it. In general, this problem is known to be intractable (PSPACE-hard in the size of the compressed trace and the LTL formula). We show that the problem can be solved in polynomial time for the fragment LTL[F,G,X], which comprises of all Boolean and modal operators of LTL except the until operator. Our algorithm for analyzing SLPs (a grammar-based compression scheme) is effective in practice — for a suite of large execution traces obtained from open source projects, our algorithm shows significant speed ups when compared with the performance of checking LTL properties over corresponding uncompressed traces.

Conditional interpolation: making concurrent program verification more effective

Due to the state-space explosion problem, efficient verification of real-world programs in large scale is still a big challenge. Particularly, thread alternation makes the verification of concurrent programs much more difficult since it aggravates this problem. In this paper, an application of Craig interpolation, namely conditional interpolation, is proposed to work together with CEGAR-based approach to reduce the state-space of concurrent tasks. Specifically, conditional interpolation is formalized to confine the reachable region of states so that infeasible conditional branches could be pruned. Furthermore, the generated conditional interpolants are utilized to shorten the interpolation paths, which makes the time consumed for verification significantly reduced. We have implemented the proposed approach on top of an open-source software model checker. Empirical results show that the conditional interpolation is effective in improving the verification efficiency of concurrent tasks.

SESSION: Model-Driven Software Engineering

AlloyMax: bringing maximum satisfaction to relational specifications

Alloy is a declarative modeling language based on a first-order relational logic. Its constraint-based analysis has enabled a wide range of applications in software engineering, including configuration synthesis, bug finding, test-case generation, and security analysis. Certain types of analysis tasks in these domains involve finding an optimal solution. For example, in a network configuration problem, instead of finding any valid configuration, it may be desirable to find one that is most permissive (i.e., it permits a maximum number of packets). Due to its dependence on SAT, however, Alloy cannot be used to specify and analyze these types of problems.

We propose AlloyMax, an extension of Alloy with a capability to express and analyze problems with optimal solutions. AlloyMax introduces (1) a small addition of language constructs that can be used to specify a wide range of problems that involve optimality and (2) a new analysis engine that leverages a Maximum Satisfiability (MaxSAT) solver to generate optimal solutions. To enable this new type of analysis, we show how a specification in a first-order relational logic can be translated into an input format of MaxSAT solvers—namely, a Boolean formula in weighted conjunctive normal form (WCNF). We demonstrate the applicability and scalability of AlloyMax on a benchmark of problems. To our knowledge, AlloyMax is the first approach to enable analysis with optimality in a relational modeling language, and we believe that AlloyMax has the potential to bring a wide range of new applications to Alloy.

Timely and accurate detection of model deviation in self-adaptive software-intensive systems

Control-based approaches to self-adaptive software-intensive systems (SASs) are hailed for their optimal performance and theoretical guarantees on the reliability of adaptation behavior. However, in practice the guarantees are often threatened by model deviations occurred at runtime. In this paper, we propose a Model-guided Deviation Detector (MoD2) for timely and accurate detection of model deviations. To ensure reliability, a SAS can switch a control-based optimal controller for a mandatory controller once an unsafe model deviation is detected. MoD2 achieves both high timeliness and high accuracy through a deliberate fusion of parameter deviation estimation, uncertainty compensation, and safe region quantification. Empirical evaluation with three exemplar systems validated the efficacy of MoD2 (93.3% shorter detection delay, 39.4% lower FN rate, and 25.2% lower FP rate), as well as the benefits of the adaptation-switching mechanism (abnormal rate dropped by 29.2%).

SESSION: Dependability

Lightweight and modular resource leak verification

A resource leak occurs when a program allocates a resource, such as a socket or file handle, but fails to deallocate it. Resource leaks cause resource starvation, slowdowns, and crashes. Previous techniques to prevent resource leaks are either unsound, imprecise, inapplicable to existing code, slow, or a combination of these.

Static detection of resource leaks requires checking that de-allocation methods are always invoked on relevant objects before they become unreachable. Our key insight is that leak detection can be reduced to an accumulation problem, a class of typestate problems amenable to sound and modular checking without the need for a heavyweight, whole-program alias analysis. The precision of an accumulation analysis can be improved by computing targeted aliasing information, and we augmented our baseline checker with three such novel techniques: a lightweight ownership transfer system; a specialized resource alias analysis; and a system to create a fresh obligation when a non-final resource field is updated.

Our approach occupies a unique slice of the design space: it is sound and runs relatively quickly (taking minutes on programs that a state-of-the-art approach took hours to analyze). We implemented our techniques for Java in an open-source tool called the Resource Leak Checker. The Resource Leak Checker revealed 49 real resource leaks in widely-deployed software. It scales well, has a manageable false positive rate (comparable to the high-confidence resource leak analysis built into the Eclipse IDE), and imposes only a small annotation burden (1/1500 LoC) for developers.

JSISOLATE: lightweight in-browser JavaScript isolation

Modern web applications commonly include third-party scripts from external hosts. While enabling code reuse and enhancing the functionalities, the reliability of client-side JavaScript code can be impaired by the inclusion of other scripts. Since all scripts run in the same execution environment in the browser, executing them all together may cause unexpected effects. For example, global variables with the same name might be defined by multiple scripts, causing the actual value to be unpredictable.

In this paper, we design a lightweight browser-based framework, JSIsolate, that provides an isolated and reliable JavaScript execution environment. JSIsolate injects scripts into different isolated environments based on their dependency relationship. In this way, it executes scripts with independent functionalities in different contexts, effectively preventing them from interfering with each other. We further evaluated the compatibility and performance overhead of JSIsolate on Alexa top 1K websites, and showed that it can efficiently isolate scripts while preserving the functionalities.

SESSION: Code Recommendation

Cross-language code search using static and dynamic analyses

As code search permeates most activities in software development,code-to-code search has emerged to support using code as a query and retrieving similar code in the search results. Applications include duplicate code detection for refactoring, patch identification for program repair, and language translation. Existing code-to-code search tools rely on static similarity approaches such as the comparison of tokens and abstract syntax trees (AST) to approximate dynamic behavior, leading to low precision. Most tools do not support cross-language code-to-code search, and those that do, rely on machine learning models that require labeled training data. We present Code-to-Code Search Across Languages (COSAL), a cross-language technique that uses both static and dynamic analyses to identify similar code and does not require a machine learning model. Code snippets are ranked using non-dominated sorting based on code token similarity, structural similarity, and behavioral similarity. We empirically evaluate COSAL on two datasets of 43,146Java and Python files and 55,499 Java files and find that 1) code search based on non-dominated ranking of static and dynamic similarity measures is more effective compared to single or weighted measures; and 2) COSAL has better precision and recall compared to state-of-the-art within-language and cross-language code-to-code search tools. We explore the potential for using COSAL on large open-source repositories and discuss scalability to more languages and similarity metrics, providing a gateway for practical,multi-language code-to-code search.

Automating the removal of obsolete TODO comments

TODO comments are very widely used by software developers to describe their pending tasks during software development. However, after performing the task developers sometimes neglect or simply forget to remove the TODO comment, resulting in obsolete TODO comments. These obsolete TODO comments can confuse development teams and may cause the introduction of bugs in the future, decreasing the software's quality and maintainability. Manually identifying obsolete TODO comments is time-consuming and expensive. It is thus necessary to detect obsolete TODO comments and remove them automatically before they cause any unwanted side effects. In this work, we propose a novel model, named TDCleaner, to identify obsolete TODO comments in software projects. TDCleaner can assist developers in just-in-time checking of TODO comments status and avoid leaving obsolete TODO comments. Our approach has two main stages: offline learning and online prediction. During offline learning, we first automatically establish <code_change, todo_comment, commit_msg> training samples and leverage three neural encoders to capture the semantic features of TODO comment, code change and commit message respectively. TDCleaner then automatically learns the correlations and interactions between different encoders to estimate the final status of the TODO comment. For online prediction, we check a TODO comment's status by leveraging the offline trained model to judge the TODO comment's likelihood of being obsolete. We built our dataset by collecting TODO comments from the top-10,000 Python and Java Github repositories and evaluated TDCleaner on them. Extensive experimental results show the promising performance of our model over a set of benchmarks. We also performed an in-the-wild evaluation with real-world software projects, we reported 18 obsolete TODO comments identified by TDCleaner to Github developers and 9 of them have already been confirmed and removed by the developers, demonstrating the practical usage of our approach.

SESSION: Fuzzing

Estimating residual risk in greybox fuzzing

For any errorless fuzzing campaign, no matter how long, there is always some residual risk that a software error would be discovered if only the campaign was run for just a bit longer. Recently, greybox fuzzing tools have found widespread adoption. Yet, practitioners can only guess when the residual risk of a greybox fuzzing campaign falls below a specific, maximum allowable threshold.

In this paper, we explain why residual risk cannot be directly estimated for greybox campaigns, argue that the discovery probability (i.e., the probability that the next generated input increases code coverage) provides an excellent upper bound, and explore sound statistical methods to estimate the discovery probability in an ongoing greybox campaign. We find that estimators for blackbox fuzzing systematically and substantially under-estimate the true risk. An engineer—who stops the campaign when the estimators purport a risk below the maximum allowable risk—is vastly misled. She might need execute a campaign that is orders of magnitude longer to achieve the allowable risk. Hence, the key challenge we address in this paper is adaptive bias: The probability to discover a specific error actually increases over time. We provide the first probabilistic analysis of adaptive bias, and introduce two novel classes of estimators that tackle adaptive bias. With our estimators, the engineer can decide with confidence when to abort the campaign.

HeteroFuzz: fuzz testing to detect platform dependent divergence for heterogeneous applications

As specialized hardware accelerators like FPGAs become a prominent part of the current computing landscape, software applications are increasingly constructed to leverage heterogeneous architectures. Such a trend is already happening in the domain of machine learning and Internet-of-Things (IoT) systems built on edge devices. Yet, debugging and testing methods for heterogeneous applications are currently lacking. These applications may look similar to regular C/C++ code but include hardware synthesis details in terms of preprocessor directives. Therefore, their behavior under heterogeneous architectures may diverge significantly from CPU due to hardware synthesis details. Further, the compilation and hardware simulation cycle takes an enormous amount of time, prohibiting frequent invocations required for fuzz testing.

We propose a novel fuzz testing technique, called HeteroFuzz, designed to specifically target heterogeneous applications and to detect platform-dependent divergence. The key essence of HeteroFuzz is that it uses a three-pronged approach to reduce the long latency of repetitively invoking a hardware simulator on a heterogeneous application. First, in addition to monitoring code coverage as a fuzzing guidance mechanism, we analyze synthesis pragmas in kernel code and monitor accelerator-relevant value spectra. Second, we design dynamic probabilistic mutations to increase the chance of hitting divergent behavior under different platforms. Third, we memorize the boundaries of seen kernel inputs and skip HLS simulator invocation if it can expose only redundant divergent behavior. We evaluate HeteroFuzz on seven real-world heterogeneous applications with FPGA kernels. HeteroFuzz is 754X faster in exposing the same set of distinct divergence symptoms than naive fuzzing. Probabilistic mutations contribute to 17.5X speed up than the one without. Selective invocation of HLS simulation contributes to 8.8X speed up than the one without.

SESSION: Defect Prediction and Effort Estimation

Sound and efficient concurrency bug prediction

Concurrency bugs are extremely difficult to detect. Recently, several dynamic techniques achieve sound analysis. M2 is even complete for two threads. It is designed to decide whether two events can occur consecutively. However, real-world concurrency bugs can involve more events and threads. Some can occur when the order of two or more events can be exchanged even if they occur not consecutively. We propose a new technique SeqCheck to soundly decide whether a sequence of events can occur in a specified order. The ordered sequence represents a potential concurrency bug. And several known forms of concurrency bugs can be easily encoded into event sequences where each represents a way that the bug can occur. To achieve it, SeqCheck explicitly analyzes branch events and includes a set of efficient algorithms. We show that SeqCheck is sound; and it is also complete on traces of two threads.

We have implemented SeqCheck to detect three types of concurrency bugs and evaluated it on 51 Java benchmarks producing up to billions of events. Compared with M2 and other three recent sound race detectors, SeqCheck detected 333 races in ~30 minutes; while others detected from 130 to 285 races in ~6 to ~12 hours. SeqCheck detected 20 deadlocks in ~6 seconds. This is only one less than Dirk; but Dirk spent more than one hour. SeqCheck also detected 30 atomicity violations in ~20 minutes. The evaluation shows SeqCheck can significantly outperform existing concurrency bug detectors.

SESSION: Vulnerabilities

Detecting Node.js prototype pollution vulnerabilities via object lookup analysis

Prototype pollution is a type of vulnerability specific to prototype-based languages, such as JavaScript, which allows an adversary to pollute a base object’s property, leading to a further consequence such as Denial of Service (DoS), arbitrary code execution, and session fixation. On one hand, the only prior work in detecting prototype pollution adopts dynamic analysis to fuzz package inputs, which inevitably has code coverage issues in triggering some deeply embedded vulnerabilities. On the other hand, it is challenging to apply state-of-the-art static analysis in detecting prototype pollution because of the involvement of prototype chains and fine-grained object relations including built-in ones.

In this paper, we propose a flow-, context-, and branch-sensitive static taint analysis tool, called ObjLupAnsys, to detect prototype pollution vulnerabilities. The key of ObjLupAnsys is a so-called object lookup analysis, which gradually expands the source and sink objects into big clusters with a complex inner structure by performing targeted object lookups in both clusters so that a system built-in function can be redefined. Specifically, at the source cluster, ObjLupAnsys proactively creates new object properties based on how the target program uses the initial source object; at the sink cluster, ObjLupAnsys assigns property values in object lookups to decrease the number of object lookups to reach a system built-in function.

We implemented an open-source tool and applied it for the detection of prototype pollution among Node.js packages. Our evaluation shows that ObjLupAnsys finds 61 zero-day, previously-unknown, exploitable vulnerabilities as opposed to 18 by the state-of-the-art dynamic fuzzing tool and three by a state-of-the-art static analysis tool that is modified to detect prototype pollution. To date, 11 vulnerable Node.js packages are assigned with CVE numbers and five have already been patched by their developers. In addition, ObjLupAnsys also discovered seven applications or packages including a real-world, online website, which are indirectly vulnerable due to the inclusion of vulnerable packages found by ObjLupAnsys.

Detecting concurrency vulnerabilities based on partial orders of memory and thread events

Memory vulnerabilities are the main causes of software security problems. However, detecting vulnerabilities in multi-threaded programs is challenging because many vulnerabilities occur under specific executions, and it is hard to explore all possible executions of a multi-threaded program. Existing approaches are either computationally intensive or likely to miss some vulnerabilities due to the complex thread interleaving. This paper introduces a novel approach to detect concurrency memory vulnerabilities based on partial orders of events. A partial order on a set of events represents the definite execution orders of events. It allows constructing feasible traces exposing specific vulnerabilities by exchanging the execution orders of vulnerability-potential events. It also reduces the search space of possible executions and thus improves computational efficiency. We propose new algorithms to extract vulnerability-potential event pairs for three kinds of memory vulnerabilities. We also design a novel algorithm to compute a potential event pair's feasible set, which contains the relevant events required by a feasible trace. Our method extends existing approaches for data race detection by considering that two events are protected by the same lock. We implement a prototype of our approach and conduct experiments to evaluate its performance. Experimental results show that our tool exhibits superiority over state-of-the-art algorithms in both effectiveness and efficiency.

Vulnerability detection with fine-grained interpretations

Despite the successes of machine learning (ML) and deep learning (DL)-based vulnerability detectors (VD), they are limited to providing only the decision on whether a given code is vulnerable or not, without details on what part of the code is relevant to the detected vulnerability. We present IVDetect, an interpretable vulnerability detector with the philosophy of using Artificial Intelligence (AI) to detect vulnerabilities, while using Intelligence Assistant (IA) to provide VD interpretations in terms of vulnerable statements.

For vulnerability detection, we separately consider the vulnerable statements and their surrounding contexts via data and control dependencies. This allows our model better discriminate vulnerable statements than using the mixture of vulnerable code and contextual code as in existing approaches. In addition to the coarse-grained vulnerability detection result, we leverage interpretable AI to provide users with fine-grained interpretations that include the sub-graph in the Program Dependency Graph (PDG) with the crucial statements that are relevant to the detected vulnerability. Our empirical evaluation on vulnerability databases shows that IVDetect outperforms the existing DL-based approaches by 43%–84% and 105%–255% in top-10 nDCG and MAP ranking scores. IVDetect correctly points out the vulnerable statements relevant to the vulnerability via its interpretation in 67% of the cases with a top-5 ranked list. IVDetect improves over the baseline interpretation models by 12.3%–400% and 9%–400% in accuracy.

Identifying casualty changes in software patches

Noise in software patches impacts their understanding, analysis, and use for tasks such as change prediction. Although several approaches have been developed to identify noise in patches, this issue has persisted. An analysis of a dataset of security patches for the Tomcat web server, which we further expanded with security patches from five additional systems, uncovered several kinds of previously unreported noise which we call nonessential casualty changes. These are changes that themselves do not alter the logic of the program but are necessitated by other changes made in the patch. In this paper, we provide a comprehensive taxonomy of casualty changes. We then develop CasCADe, an automated technique for automatically identifying casualty changes. We evaluate CasCADe with several publicly available datasets of patches and tools that focus on them. Our results show that CasCADe is highly accurate, that the kinds of noise it identifies occur relatively commonly in patches, and that removing this noise improves upon the evaluation results of a previously published change-based approach.

ACHyb: a hybrid analysis approach to detect kernel access control vulnerabilities

Access control is essential for the Operating System (OS) security. Incorrect implementation of access control can introduce new attack surfaces to the OS, known as Kernel Access Control Vulnerabilities (KACVs). To understand KACVs, we conduct our study on the root causes and the security impacts of KACVs. Regarding the complexity of the recognized root causes, we particularly focus on two kinds of KACVs, namely KACV-M (due to missing permission checks) and KACV-I (due to misusing permission checks). We find that over 60% of these KACVs are of critical, high or medium security severity, resulting in a variety of security threats including bypass security checking, privileged escalation, etc. However, existing approaches can only detect KACV-M. The state-of-the-art KACV-M detector called PeX is a static analysis tool, which still suffers from extremely high false-positive rates.

In this paper, we present ACHyb, a precise and scalable approach to reveal both KACV-M and KACV-I. ACHyb is a hybrid approach, which first applies static analysis to identify the potentially vulnerable paths and then applies dynamic analysis to further reduce the false positives of the paths. For the static analysis, ACHyb improves PeX in both the precision and the soundness, using the interface analysis, callsite dependence analysis and constraint-based invariant analysis with a stronger access control invariant. For the dynamic analysis, ACHyb utilizes the greybox fuzzing to identify the potential KACVs. In order to improve the fuzzing efficiency, ACHyb adopts our novel clustering-based seed distillation approach to generate high-quality seed programs. Our experimental results show that ACHyb reveals 76 potential KACVs in less than 8 hours and 22 of them are KACVs (19 KACV-M and 3 KACV-I). In contrast, PeX reveals 2,088 potential KACVs in more than 11 hours, and only 14 of them are KACVs (all KACV-M). Furthermore, ACHyb successfully uncovers 7 new KACVs, and 2 of them (1 KACV-M and 1 KACV-I) have been confirmed by kernel developers.

SESSION: Program Repair

Context-aware and data-driven feedback generation for programming assignments

Recently, various techniques have been proposed to automatically provide personalized feedback on programming exercises. The cutting edge of which is the data-driven approaches that leverage a corpus of existing correct programs and repair incorrect submissions by using similar reference programs in the corpus. However, current data-driven techniques work under the strong assumption that the corpus contains a solution program that is close enough to the incorrect submission. In this paper, we present Cafe, a new data-driven approach for feedback generation that overcomes this limitation. Unlike existing approaches, Cafe uses a novel context-aware repair algorithm that can generate feedback even if the incorrect program differs significantly from the reference solutions. We implemented Cafe for OCaml and evaluated it with 4,211 real student programs. The results show that Cafe is able to repair 83 % of incorrect submissions, far outperforming existing approaches.

A syntax-guided edit decoder for neural program repair

Automated Program Repair (APR) helps improve the efficiency of software development and maintenance. Recent APR techniques use deep learning, particularly the encoder-decoder architecture, to generate patches. Though existing DL-based APR approaches have proposed different encoder architectures, the decoder remains to be the standard one, which generates a sequence of tokens one by one to replace the faulty statement. This decoder has multiple limitations: 1) allowing to generate syntactically incorrect programs, 2) inefficiently representing small edits, and 3) not being able to generate project-specific identifiers.

In this paper, we propose Recoder, a syntax-guided edit decoder with placeholder generation. Recoder is novel in multiple aspects: 1) Recoder generates edits rather than modified code, allowing efficient representation of small edits; 2) Recoder is syntax-guided, with the novel provider/decider architecture to ensure the syntactic correctness of the patched program and accurate generation; 3) Recoder generates placeholders that could be instantiated as project-specific identifiers later.

We conduct experiments to evaluate Recoder on 395 bugs from Defects4J v1.2, 420 additional bugs from Defects4J v2.0, 297 bugs from IntroClassJava and 40 bugs from QuixBugs. Our results show that Recoder repairs 53 bugs on Defects4J v1.2, which achieves 26.2% (11 bugs) improvement over the previous state-of-the-art approach for single-hunk bugs (TBar). Importantly, to our knowledge, Recoder is the first DL-based APR approach that has outperformed the traditional APR approaches on this benchmark. Furthermore, Recoder repairs 19 bugs on the additional bugs from Defects4J v2.0, which is 137.5% (11 bugs) more than TBar and 850% (17 bugs) more than SimFix. Recoder also achieves 775% (31 bugs) and 30.8% (4 bugs) improvement on IntroClassJava and QuixBugs over the baselines respectively. These results suggest that Recoder has better generalizability than existing APR approaches.

VarFix: balancing edit expressiveness and search effectiveness in automated program repair

Automatically repairing a buggy program is essentially a search problem, searching for code transformations that pass a set of tests. Various search strategies have been explored, but they either navigate the search space in an ad hoc way using heuristics, or systemically but at the cost of limited edit expressiveness in the kinds of supported program edits. In this work, we explore the possibility of systematically navigating the search space without sacrificing edit expressiveness. The key enabler of this exploration is variational execution, a dynamic analysis technique that has been shown to be effective at exploring many similar executions in large search spaces. We evaluate our approach on IntroClassJava and Defects4J, showing that a systematic search is effective at leveraging and combining fixing ingredients to find patches, including many high-quality patches and multi-edit patches.

SESSION: Flaky Tests

Flaky test detection in Android via event order exploration

Validation of Android apps via testing is difficult owing to the presence of flaky tests. Due to non-deterministic execution environments, a sequence of events (a test) may lead to success or failure in unpredictable ways. In this work, we present an approach and tool FlakeScanner for detecting flaky tests through exploration of event orders. Our key observation is that for a test in a mobile app, there is a testing framework thread which creates the test events, a main User-Interface (UI) thread processing these events, and there may be several other background threads running asynchronously. For any event e whose execution involves potential non-determinism, we localize the earliest (latest) event after (before) which e must happen. We then efficiently explore the schedules between the upper/lower bound events while grouping events within a single statement, to find whether the test outcome is flaky. We also create a suite of subject programs called FlakyAppRepo (containing 33 widely-used Android projects) to study flaky tests in Android apps. Our experiments on the subject-suite FlakyAppRepo show FlakeScanner detected 45 out of 52 known flaky tests as well as 245 previously unknown flaky tests among 1444 tests.

SESSION: Collaborative Software Engineering

SmartCommit: a graph-based interactive assistant for activity-oriented commits

In collaborative software development, it is considered to be a best practice to submit code changes as a sequence of cohesive commits, each of which records the work result of a specific development activity, such as adding a new feature, bug fixing, and refactoring. However, rather than following this best practice, developers often submit a set of loosely-related changes serving for different development activities as a composite commit, due to the tedious manual work and lack of effective tool support to decompose such a tangled changeset. Composite commits often obfuscate the change history of software artifacts and bring challenges to efficient collaboration among developers. To encourage activity-oriented commits, we propose SmartCommit, a graph-partitioning-based interactive approach to tangled changeset decomposition that leverages not only the efficiency of algorithms but also the knowledge of developers. To evaluate the effectiveness of our approach, we (1) deployed SmartCommit in an international IT company, and analyzed usage data collected from a field study with 83 engineers over 9 months; and (2) conducted a controlled experiment on 3,000 synthetic composite commits from 10 diverse open-source projects. Results show that SmartCommit achieves a median accuracy between 71–84% when decomposing composite commits without developer involvement, and significantly helps developers follow the best practice of submitting activity-oriented commits with acceptable interaction effort and time cost in real collaborative software development.

A first look at developers’ live chat on Gitter

Modern communication platforms such as Gitter and Slack play an increasingly critical role in supporting software teamwork, especially in open source development.Conversations on such platforms often contain intensive, valuable information that may be used for better understanding OSS developer communication and collaboration. However, little work has been done in this regard. To bridge the gap, this paper reports a first comprehensive empirical study on developers' live chat, investigating when they interact, what community structures look like, which topics are discussed, and how they interact. We manually analyze 749 dialogs in the first phase, followed by an automated analysis of over 173K dialogs in the second phase. We find that developers tend to converse more often on weekdays, especially on Wednesdays and Thursdays (UTC), that there are three common community structures observed, that developers tend to discuss topics such as API usages and errors, and that six dialog interaction patterns are identified in the live chat communities. Based on the findings, we provide recommendations for individual developers and OSS communities, highlight desired features for platform vendors, and shed light on future research directions. We believe that the findings and insights will enable a better understanding of developers' live chat, pave the way for other researchers, as well as a better utilization and mining of knowledge embedded in the massive chat history.

Reel life vs. real life: how software developers share their daily life through vlogs

Software developers are turning to vlogs (video blogs) to share what a day is like to walk in their shoes. Through these vlogs developers share a rich perspective of their technical work as well their personal lives. However, does the type of activities portrayed in vlogs differ from activities developers in the industry perform? Would developers at a software company prefer to show activities to different extents if they were asked to share about their day through vlogs? To answer these questions, we analyzed 130 vlogs by software developers on YouTube and conducted a survey with 335 software developers at a large software company. We found that although vlogs present traditional development activities such as coding and code peripheral activities (11%), they also prominently feature wellness and lifestyle related activities (47.3%) that have not been reflected in previous software engineering literature. We also found that developers at the software company were inclined to share more non-coding tasks (e.g., personal projects, time spent with family and friends, and health) when asked to create a mock-up vlog to promote diversity. These findings demonstrate a shift in our understanding of how software developers are spending their time and find valuable to share publicly. We discuss how vlogs provide a more complete perspective of software development work and serve as a valuable source of data for empirical research.

SESSION: Cloud Computing

An empirical study on challenges of application development in serverless computing

Serverless computing is an emerging paradigm for cloud computing, gaining traction in a wide range of applications such as video processing and machine learning. This new paradigm allows developers to focus on the development of the logic of serverless computing based applications (abbreviated as serverless-based applications) in the granularity of function, thereby freeing developers from tedious and error-prone infrastructure management. Meanwhile, it also introduces new challenges on the design, implementation, and deployment of serverless-based applications, and current serverless computing platforms are far away from satisfactory. However, to the best of our knowledge, these challenges have not been well studied. To fill this knowledge gap, this paper presents the first comprehensive study on understanding the challenges in developing serverless-based applications from the developers’ perspective. We mine and analyze 22,731 relevant questions from Stack Overflow (a popular Q&A website for developers), and show the increasing popularity trend and the high difficulty level of serverless computing for developers. Through manual inspection of 619 sampled questions, we construct a taxonomy of challenges that developers encounter, and report a series of findings and actionable implications. Stakeholders including application developers, researchers, and cloud providers can leverage these findings and implications to better understand and further explore the serverless computing paradigm.

SESSION: Search Based Software Engineering

Bias in machine learning software: why? how? what to do?

Increasingly, software is making autonomous decisions in case of criminal sentencing, approving credit cards, hiring employees, and so on. Some of these decisions show bias and adversely affect certain social groups (e.g. those defined by sex, race, age, marital status). Many prior works on bias mitigation take the following form: change the data or learners in multiple ways, then see if any of that improves fairness. Perhaps a better approach is to postulate root causes of bias and then applying some resolution strategy. This paper postulates that the root causes of bias are the prior decisions that affect- (a) what data was selected and (b) the labels assigned to those examples. Our Fair-SMOTE algorithm removes biased labels; and rebalances internal distributions such that based on sensitive attribute, examples are equal in both positive and negative classes. On testing, it was seen that this method was just as effective at reducing bias as prior approaches. Further, models generated via Fair-SMOTE achieve higher performance (measured in terms of recall and F1) than other state-of-the-art fairness improvement algorithms. To the best of our knowledge, measured in terms of number of analyzed learners and datasets, this study is one of the largest studies on bias mitigation yet presented in the literature.

Understanding neural code intelligence through program simplification

A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these approaches are often specific to a particular set of network architectures, even requiring access to the network's parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND's extracted features may help understand neural CI systems' predictions and learned behavior.

Multi-objectivizing software configuration tuning

Automatically tuning software configuration for optimizing a single performance attribute (e.g., minimizing latency) is not trivial, due to the nature of the configuration systems (e.g., complex landscape and expensive measurement). To deal with the problem, existing work has been focusing on developing various effective optimizers. However, a prominent issue that all these optimizers need to take care of is how to avoid the search being trapped in local optima — a hard nut to crack for software configuration tuning due to its rugged and sparse landscape, and neighboring configurations tending to behave very differently. Overcoming such in an expensive measurement setting is even more challenging. In this paper, we take a different perspective to tackle this issue. Instead of focusing on improving the optimizer, we work on the level of optimization model. We do this by proposing a meta multi-objectivization model (MMO) that considers an auxiliary performance objective (e.g., throughput in addition to latency). What makes this model unique is that we do not optimize the auxiliary performance objective, but rather use it to make similarly-performing while different configurations less comparable (i.e. Pareto nondominated to each other), thus preventing the search from being trapped in local optima.

Experiments on eight real-world software systems/environments with diverse performance attributes reveal that our MMO model is statistically more effective than state-of-the-art single-objective counterparts in overcoming local optima (up to 42% gain), while using as low as 24% of their measurements to achieve the same (or better) performance result.

SESSION: Libraries and APIs

Embedding app-library graph for neural third party library recommendation

The mobile app marketplace has fierce competition for mobile app developers, who need to develop and update their apps as soon as possible to gain first mover advantage. Third-party libraries (TPLs) offer developers an easier way to enhance their apps with new features. However, how to find suitable candidates among the high number and fast-changing TPLs is a challenging problem. TPL recommendation is a promising solution, but unfortunately existing approaches suffer from low accuracy in recommendation results. To tackle this challenge, we propose GRec, a graph neural network (GNN) based approach, for recommending potentially useful TPLs for app development. GRec models mobile apps, TPLs, and their interactions into an app-library graph. It then distills app-library interaction information from the app-library graph to make more accurate TPL recommendations. To evaluate GRec’s performance, we conduct comprehensive experiments based on a large-scale real-world Android app dataset containing 31,432 Android apps, 752 distinct TPLs, and 537,011 app-library usage records. Our experimental results illustrate that GRec can significantly increase the prediction accuracy and diversify the prediction results compared with state-of-the-art methods. A user study performed with app developers also confirms GRec's usefulness for real-world mobile app development.

A large-scale empirical study on Java library migrations: prevalence, trends, and rationales

With the rise of open-source software and package hosting platforms, reusing 3rd-party libraries has become a common practice. Due to various failures during software evolution, a project may remove a used library and replace it with another library, which we call library migration. Despite substantial research on dependency management, the understanding of how and why library migrations occur is still lacking. Achieving this understanding may help practitioners optimize their library selection criteria, develop automated approaches to monitor dependencies, and provide migration suggestions for their libraries or software projects. In this paper, through a fine-grained commit-level analysis of 19,652 Java GitHub projects, we extract the largest migration dataset to-date (1,194 migration rules, 3,163 migration commits). We show that 8,065 (41.04%) projects having at least one library removal, 1,564 (7.96%, lower-bound) to 5,004 (25.46%, upper-bound) projects have at least one migration, and a median project with migrations has 2 to 4 migrations in total. We discover that library migrations are dominated by several domains (logging, JSON, testing and web service) presenting a long tail distribution. Also, migrations are highly unidirectional in that libraries are either mostly abandoned or mostly chosen in our project corpus. A thematic analysis on related commit messages, issues, and pull requests identifies 14 frequently mentioned migration reasons (e.g., lack of maintenance, usability, integration, etc), 7 of which are not discussed in previous work. Our findings can be operationalized into actionable insights for package hosting platforms, project maintainers, and library developers. We provide a replication package at <a>https://doi.org/10.5281/zenodo.4816752</a>.

Learning-based extraction of first-order logic representations of API directives

Developers often rely on API documentation to learn API directives, i.e., constraints and guidelines related to API usage. Failing to follow API directives may cause defects or improper implementations. Since there are no industry-wide standards on how to document API directives, they take many forms and are often hard to understand by developers or challenging to parse with tools.

In this paper, we propose a learning based approach for extracting first-order logic representations of API directives (FOL directives for short). The approach, called LEADFOL, uses a joint learning method to extract atomic formulas by identifying the predicates and arguments involved in directive sentences, and recognizes the logical relations between atomic formulas, by parsing the sentence structures. It then parses the arguments and uses a learning based method to link API references to their corresponding API elements. Finally, it groups the formulas of the same class or method together and transforms them into conjunctive normal form. Our evaluation shows that LEADFOL can accurately extract more FOL directives than a state-of-the-art approach and that the extracted FOL directives are useful in supporting code reviews.

SESSION: Development Tools

DIFFBASE: a differential factbase for effective software evolution management

Numerous tools and techniques have been developed to extract and analyze information from software development artifacts. Yet, there is a lack of effective method to process, store, and exchange information among different analyses. In this paper, we propose differential factbase, a uniform exchangeable representation supporting efficient querying and manipulation, based on the existing concept of program facts. We consider program changes as first-class objects, which establish links between intra-version facts of single program snapshots and provide insights on how certain artifacts evolve over time via inter-version facts. We implement a series of differential fact extractors supporting different programming languages and platforms, and demonstrate with usage scenarios the benefits of adopting differential facts in supporting software evolution management.

Would you like a quick peek? providing logging support to monitor data processing in big data applications

To analyze large-scale data efficiently, developers have created various big data processing frameworks (e.g., Apache Spark). These big data processing frameworks provide abstractions to developers so that they can focus on implementing the data analysis logic. In traditional software systems, developers leverage logging to monitor applications and record intermediate states to assist workload understanding and issue diagnosis. However, due to the abstraction and the peculiarity of big data frameworks, there is currently no effective monitoring approach for big data applications. In this paper, we first manually study 1,000 randomly sampled Spark-related questions on Stack Overflow to study their root causes and the type of information, if recorded, that can assist developers with motioning and diagnosis. Then, we design an approach, DPLOG, which assists developers with monitoring Spark applications. DPLOG leverages statistical sampling to minimize performance overhead and provides intermediate information and hint/warning messages for each data processing step of a chained method pipeline. We evaluate DPLOG on six benchmarking programs and find that DPLOG has a relatively small overhead (i.e., less than 10% increase in response time when processing 5GB data) compared to without using DPLOG, and reduce the overhead by over 500% compared to the baseline. Our user study with 20 developers shows that DPLOG can reduce the needed time to debug big data applications by 63% and the participants give DPLOG an average of 4.85/5 for its usefulness. The idea of DPLOG may be applied to other big data processing frameworks, and our study sheds light on future research opportunities in assisting developers with monitoring big data applications.

SESSION: Code Review and Changes

Identifying bad software changes via multimodal anomaly detection for online service systems

In large-scale online service systems, software changes are inevitable and frequent. Due to importing new code or configurations, changes are likely to incur incidents and destroy user experience. Thus it is essential for engineers to identify bad software changes, so as to reduce the influence of incidents and improve system re- liability. To better understand bad software changes, we perform the first empirical study based on large-scale real-world data from a large commercial bank. Our quantitative analyses indicate that about 50.4% of incidents are caused by bad changes, mainly be- cause of code defect, configuration error, resource contention, and software version. Besides, our qualitative analyses show that the current practice of detecting bad software changes performs not well to handle heterogeneous multi-source data involved in soft- ware changes. Based on the findings and motivation obtained from the empirical study, we propose a novel approach named SCWarn aiming to identify bad changes and produce interpretable alerts accurately and timely. The key idea of SCWarn is drawing support from multimodal learning to identify anomalies from heterogeneous multi-source data. An extensive study on two datasets with various bad software changes demonstrates our approach significantly outperforms all the compared approaches, achieving 0.95 F1-score on average and reducing MTTD (mean time to detect) by 20.4%∼60.7%. In particular, we shared some success stories and lessons learned from the practical usage.

SESSION: Refactoring

An automatic refactoring framework for replacing test-production inheritance by mocking mechanism

Unit testing focuses on verifying the functions of individual units of a software system. It is challenging due to the high inter-dependencies among software units. Developers address this by mocking-replacing the dependency by a "faked" object. Despite the existence of powerful, dedicated mocking frameworks, developers often turn to a "hand-rolled" approach-inheritance. That is, they create a subclass of the dependent class and mock its behavior through method overriding. However, this requires tedious implementation and compromises the design quality of unit tests. This work contributes a fully automated refactoring framework to identify and replace the usage of inheritance by using Mockito-a well received mocking framework. Our approach is built upon the empirical experience from five open source projects that use inheritance for mocking. We evaluate our approach on four other projects. Results show that our framework is efficient, generally applicable to new datasets, mostly preserves test case behaviors in detecting defects (in the form of mutants), and decouples test code from production code. The qualitative evaluation by experienced developers suggests that the auto-refactoring solutions generated by our framework improve the quality of the unit test cases in various aspects, such as making test conditions more explicit, as well as improved cohesion, readability, understandability, and maintainability with test cases.

SESSION: Blockchain

ÐArcher: detecting on-chain-off-chain synchronization bugs in decentralized applications

Since the emergence of Ethereum, blockchain-based decentralized applications (DApps) have become increasingly popular and important. To balance the security, performance, and costs, a DApp typically consists of two layers: an on-chain layer to execute transactions and store crucial data on the blockchain and an off-chain layer to interact with users. A DApp needs to synchronize its off-chain layer with the on-chain layer proactively. Otherwise, the inconsistent data in the off-chain layer could mislead users and cause undesirable consequences, e.g., loss of transaction fees. However, transactions sent to the blockchain are not guaranteed to be executed and could even be reversed after execution due to chain reorganization. Such non-determinism in the transaction execution is unique to blockchain. DApp developers may fail to perform the on-chain-off-chain synchronization accurately due to their lack of familiarity with the complex transaction lifecycle. In this work, we investigate the challenges of synchronizing on-chain and off-chain data in Ethereum-based DApps. We present two types of bugs that could result in inconsistencies between the on-chain and off-chain layers. To help detect such on-chain-off-chain synchronization bugs, we introduce a state transition model to guide the testing of DApps and propose two effective oracles to facilitate the automatic identification of bugs. We build the first testing framework, ÐArcher, to detect on-chain-off-chain synchronization bugs in DApps. We have evaluated ÐArcher on 11 popular real-world DApps. ÐArcher achieves high precision (99.3%), recall (87.6%), and accuracy (89.4%) in bug detection and significantly outperforms the baseline methods. It has found 15 real bugs in the 11 DApps. So far, six of the 15 bugs have been confirmed by the developers, and three have been fixed. These promising results demonstrate the usefulness of ÐArcher.

iBatch: saving Ethereum fees via secure and cost-effective batching of smart-contract invocations

This paper presents iBatch, a middleware system running on top of an operational Ethereum network to enable secure batching of smart-contract invocations against an untrusted relay server off-chain. iBatch does so at a low overhead by validating the server's batched invocations in smart contracts without additional states. The iBatch mechanism supports a variety of policies, ranging from conservative to aggressive batching, and can be configured adaptively to the current workloads. iBatch automatically rewrites smart contracts to integrate with legacy applications and support large-scale deployment.

For cost evaluation, we develop a platform with fast and cost-accurate transaction replaying, build real transaction benchmarks on popular Ethereum applications, and build a functional prototype of iBatch on Ethereum. The evaluation results show that iBatch saves 14.6%-59.1% Gas cost per invocation with a moderate 2-minute delay and 19.06%-31.52% Ether cost per invocation with a delay of 0.26-1.66 blocks.

SESSION: Recommender Systems

Which abbreviations should be expanded?

Abbreviations are common in source code. Properly designed abbreviations may significantly facilitate typing, typesetting, and reading of lengthy source code. However, abbreviations, if used improperly, may also significantly reduce the readability and maintainability of source code. Although a few automated approaches have been proposed to suggest full terms for given abbreviations, to the best of our knowledge, there is no automated approaches to suggest whether abbreviations are used properly, i.e., whether they should be replaced with corresponding full terms. Notably, it is often challenging for inexperienced developers and maintainers to make such decisions. To this end, in this paper, we propose an automated approach to assisting developers and maintainers in making the decisions. The rationale of the approach is that abbreviations should not be expanded if the expansion would result in unacceptably lengthy identifiers or if developers/maintainers can easily figure out the meaning (full terms) of the abbreviations based on their domain knowledge or contexts of the abbreviations. From a corpus of programs, we leverage data mining techniques to discover common abbreviations that are frequently employed by various developers in similar contexts. The key of the data mining is to turn the problem of mining common abbreviations into the maximal clique problem that has been extensively studied. We suggest to not expand given abbreviation if it matches at least one of the discovered common abbreviations. From the same corpus, we also calculate the probability distribution for the length of different types of identifier, e.g., variable names and method names. The probability distribution specifies how likely an identifier of type T is composed of exactly n characters. Our heuristic is to not expand the abbreviation if the probability of its enclosing identifier would be reduced by the expansion. Finally, we also suggest to not expand the abbreviation if its full terms are contained in surrounding contexts of the abbreviation, i.e., tokens on the same source code line. Other abbreviations that do not receive suggestions from the proposed approach are expected to be replaced with their full terms. Our evaluation results on 1,818 abbreviations from five open-source applications suggest that the proposed approach is accurate with a high accuracy of 95%.

SESSION: Testing of Machine Learning Models

Validation on machine reading comprehension software without annotated labels: a property-based method

Machine Reading Comprehension (MRC) in Natural Language Processing has seen great progress recently. But almost all the current MRC software is validated with a reference-based method, which requires well-annotated labels for test cases and tests the software by checking the consistency between the labels and the outputs. However, labeling test cases of MRC could be very costly due to their complexity, which makes reference-based validation hard to be extensible and sufficient. Furthermore, solely checking the consistency and measuring the overall score may not be sensible and flexible for assessing the language understanding capability. In this paper, we propose a property-based validation method for MRC software with Metamorphic Testing to supplement the reference-based validation. It does not refer to the labels and hence can make much data available for testing. Besides, it validates MRC software against various linguistic properties to give a specific and in-depth picture on linguistic capabilities of MRC software. Comprehensive experimental results show that our method can successfully reveal violations to the target linguistic properties without the labels. Moreover, it can reveal problems that have been concealed by the traditional validation. Comparison according to the properties provides deeper and more concrete ideas about different language understanding capabilities of the MRC software.

FLEX: fixing flaky tests in machine learning projects by updating assertion bounds

Many machine learning (ML) algorithms are inherently random – multiple executions using the same inputs may produce slightly different results each time. Randomness impacts how developers write tests that check for end-to-end quality of their implementations of these ML algorithms. In particular, selecting the proper thresholds for comparing obtained quality metrics with the reference results is a non-intuitive task, which may lead to flaky test executions.

We present FLEX, the first tool for automatically fixing flaky tests due to algorithmic randomness in ML algorithms. FLEX fixes tests that use approximate assertions to compare actual and expected values that represent the quality of the outputs of ML algorithms. We present a technique for systematically identifying the acceptable bound between the actual and expected output quality that also minimizes flakiness. Our technique is based on the Peak Over Threshold method from statistical Extreme Value Theory, which estimates the tail distribution of the output values observed from several runs. Based on the tail distribution, FLEX updates the bound used in the test, or selects the number of test re-runs, based on a desired confidence level.

We evaluate FLEX on a corpus of 35 tests collected from the latest versions of 21 ML projects. Overall, FLEX identifies and proposes a fix for 28 tests. We sent 19 pull requests, each fixing one test, to the developers. So far, 9 have been accepted by the developers.

SESSION: Analysis and Testing of Unconventional Software

Parallel shadow execution to accelerate the debugging of numerical errors

This paper proposes a new approach for debugging errors in floating point computation by performing shadow execution with higher precision in parallel. The programmer specifies parts of the program that need to be debugged for errors. Our compiler creates shadow execution tasks, which execute on different cores and perform the computation with higher precision. We propose a novel method to execute a shadow execution task from an arbitrary memory state, which is necessary because we are creating a parallel shadow execution from a sequential program. Our approach also ensures that the shadow execution follows the same control flow path as the original program. Our runtime automatically distributes the shadow execution tasks to balance the load on the cores. Our prototype for parallel shadow execution, PFPSanitizer, provides comprehensive detection of errors while having lower performance overheads than prior approaches.

Exposing numerical bugs in deep learning via gradient back-propagation

Numerical computation is dominant in deep learning (DL) programs. Consequently, numerical bugs are one of the most prominent kinds of defects in DL programs. Numerical bugs can lead to exceptional values such as NaN (Not-a-Number) and INF (Infinite), which can be propagated and eventually cause crashes or invalid outputs. They occur when special inputs cause invalid parameter values at internal mathematical operations such as log(). In this paper, we propose the first dynamic technique, called GRIST, which automatically generates a small input that can expose numerical bugs in DL programs. GRIST piggy-backs on the built-in gradient computation functionalities of DL infrastructures. Our evaluation on 63 real-world DL programs shows that GRIST detects 78 bugs including 56 unknown bugs. By submitting them to the corresponding issue repositories, eight bugs have been confirmed and three bugs have been fixed. Moreover, GRIST can save 8.79X execution time to expose numerical bugs compared to running original programs with its provided inputs. Compared to the state-of-the-art technique DEBAR (which is a static technique), DEBAR produces 12 false positives and misses 31 true bugs (of which 30 bugs can be found by GRIST), while GRIST only misses one known bug in those programs and no false positive. The results demonstrate the effectiveness of GRIST.

Metamorphic testing of Datalog engines

Datalog is a popular query language with applications in several domains. Like any complex piece of software, Datalog engines may contain bugs. The most critical ones manifest as incorrect results when evaluating queries—we refer to these as query bugs. Given the wide applicability of the language, query bugs may have detrimental consequences, for instance, by compromising the soundness of a program analysis that is implemented and formalized in Datalog. In this paper, we present the first metamorphic-testing approach for detecting query bugs in Datalog engines. We ran our tool on three mature engines and found 13 previously unknown query bugs, some of which are deep and revealed critical semantic issues.

SESSION: Human Computer Interaction

Synthesis of web layouts from examples

We present a new technique for synthesizing dynamic, constraint-based visual layouts from examples. Our technique tackles two major challenges of layout synthesis. First, realistic layouts, especially on the web, often contain hundreds of elements, so the synthesizer needs to scale to layouts of this complexity. Second, in common usage scenarios, examples contain noise, so the synthesizer needs to be tolerant to imprecise inputs. To address these challenges we propose a two-phase approach to synthesis, where a local inference phase rapidly generates a set of likely candidate constraints that satisfy the given examples, and then a global inference phase selects a subset of the candidates that generalizes to unseen inputs. This separation of concerns helps our technique tackle the two challenges: the local phase employs Bayesian inference to handle noisy inputs, while the global phase leverages the hierarchical nature of complex layouts to decompose the global inference problem into inference of independent sub-layouts.

We implemented this technique in a tool called Mockdown and evaluated it on nine real-world web layouts, as well as a series of widespread layout components and an existing dataset of 644 Android applications. Our experiments show that Mockdown is able to synthesize a highly accurate layout for the majority of benchmarks from just three examples (two for Android layouts), and that it scales to layouts with over 600 elements, about 30x more than has been reported in prior work on layout synthesis.

SESSION: Machine Learning for Software Engineering

Boosting coverage-based fault localization via graph-based representation learning

Coverage-based fault localization has been extensively studied in the literature due to its effectiveness and lightweightness for real-world systems. However, existing techniques often utilize coverage in an oversimplified way by abstracting detailed coverage into numbers of tests or boolean vectors, thus limiting their effectiveness in practice. In this work, we present a novel coverage-based fault localization technique, GRACE, which fully utilizes detailed coverage information with graph-based representation learning. Our intuition is that coverage can be regarded as connective relationships between tests and program entities, which can be inherently and integrally represented by a graph structure: with tests and program entities as nodes, while with coverage and code structures as edges. Therefore, we first propose a novel graph-based representation to reserve all detailed coverage information and fine-grained code structures into one graph. Then we leverage Gated Graph Neural Network to learn valuable features from the graph-based coverage representation and rank program entities in a listwise way. Our evaluation on the widely used benchmark Defects4J (V1.2.0) shows that GRACE significantly outperforms state-of-the-art coverage-based fault localization: GRACE localizes 195 bugs within Top-1 whereas the best compared technique can at most localize 166 bugs within Top-1. We further investigate the impact of each GRACE component and find that they all positively contribute to GRACE. In addition, our results also demonstrate that GRACE has learnt essential features from coverage, which are complementary to various information used in existing learning-based fault localization. Finally, we evaluate GRACE in the cross-project prediction scenario on extra 226 bugs from Defects4J (V2.0.0), and find that GRACE consistently outperforms state-of-the-art coverage-based techniques.

SynGuar: guaranteeing generalization in programming by example

Programming by Example (PBE) is a program synthesis paradigm in which the synthesizer creates a program that matches a set of given examples. In many applications of such synthesis (e.g., program repair or reverse engineering), we are to reconstruct a program that is close to a specific target program, not merely to produce some program that satisfies the seen examples. In such settings, we wish that the synthesized program generalizes well, i.e., has as few errors as possible on the unobserved examples capturing the target function behavior. In this paper, we propose the first framework (called SynGuar) for PBE synthesizers that guarantees to achieve low generalization error with high probability. Our main contribution is a procedure to dynamically calculate how many additional examples suffice to theoretically guarantee generalization. We show how our techniques can be used in 2 well-known synthesis approaches: PROSE and STUN (synthesis through unification), for common string-manipulation program benchmarks. We find that often a few hundred examples suffice to provably bound generalization error below 5% with high (≥ 98%) probability on these benchmarks. Further, we confirm this empirically: SynGuar significantly improves the accuracy of existing synthesizers in generating the right target programs. But with fewer examples chosen arbitrarily, the same baseline synthesizers (without SynGuar) overfit and lose accuracy.

StateFormer: fine-grained type recovery from binaries using generative state modeling

Binary type inference is a critical reverse engineering task supporting many security applications, including vulnerability analysis, binary hardening, forensics, and decompilation. It is a difficult task because source-level type information is often stripped during compilation, leaving only binaries with untyped memory and register accesses. Existing approaches rely on hand-coded type inference rules defined by domain experts, which are brittle and require nontrivial effort to maintain and update. Even though machine learning approaches have shown promise at automatically learning the inference rules, their accuracy is still low, especially for optimized binaries.

We present StateFormer, a new neural architecture that is adept at accurate and robust type inference. StateFormer follows a two-step transfer learning paradigm. In the pretraining step, the model is trained with Generative State Modeling (GSM), a novel task that we design to teach the model to statically approximate execution effects of assembly instructions in both forward and backward directions. In the finetuning step, the pretrained model learns to use its knowledge of operational semantics to infer types.

We evaluate StateFormer's performance on a corpus of 33 popular open-source software projects containing over 1.67 billion variables of different types. The programs are compiled with GCC and LLVM over 4 optimization levels O0-O3, and 3 obfuscation passes based on LLVM. Our model significantly outperforms state-of-the-art ML-based tools by 14.6% in recovering types for both function arguments and variables. Our ablation studies show that GSM improves type inference accuracy by 33%.

Empirical study of transformers for source code

Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i.e., it follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.

Explaining mispredictions of machine learning models using rule induction

While machine learning (ML) models play an increasingly prevalent role in many software engineering tasks, their prediction accuracy is often problematic. When these models do mispredict, it can be very difficult to isolate the cause. In this paper, we propose a technique that aims to facilitate the debugging process of trained statistical models. Given an ML model and a labeled data set, our method produces an interpretable characterization of the data on which the model performs particularly poorly. The output of our technique can be useful for understanding limitations of the training data or the model itself; it can also be useful for ensembling if there are multiple models with different strengths. We evaluate our approach through case studies and illustrate how it can be used to improve the accuracy of predictive models used for software engineering tasks within Facebook.

Generalizable and interpretable learning for configuration extrapolation

Modern software applications are increasingly configurable, which puts a burden on users to tune these configurations for their target hardware and workloads. To help users, machine learning techniques can model the complex relationships between software configuration parameters and performance. While powerful, these learners have two major drawbacks: (1) they rarely incorporate prior knowledge and (2) they produce outputs that are not interpretable by users. These limitations make it difficult to (1) leverage information a user has already collected (e.g., tuning for new hardware using the best configurations from old hardware) and (2) gain insights into the learner’s behavior (e.g., understanding why the learner chose different configurations on different hardware or for different workloads). To address these issues, this paper presents two configuration optimization tools, GIL and GIL+, using the proposed generalizable and interpretable learning approaches. To incorporate prior knowledge, the proposed tools (1) start from known configurations, (2) iteratively construct a new linear model, (3) extrapolate better performance configurations from that model, and (4) repeat. Since the base learners are linear models, these tools are inherently interpretable. We enhance this property with a graphical representation of how they arrived at the highest performance configuration. We evaluate GIL and GIL+ by using them to configure Apache Spark workloads on different hardware platforms and find that, compared to prior work, GIL and GIL+ produce comparable, and sometimes even better performance configurations, but with interpretable results.

SESSION: Program Comprehension

Lightweight global and local contexts guided method name recommendation with prior knowledge

The quality of method names is critical for the readability and maintainability of source code. However, it is often challenging to construct concise method names. To alleviate this problem, a number of approaches have been proposed to automatically recommend high-quality names for methods. Despite being effective, existing approaches meet their bottlenecks mainly in two aspects: (1) the leveraged information is restricted to the target method itself; and (2) lack of distinctions towards the contributions of tokens extracted from different program contexts. Through a large-scale empirical analysis on +12M methods from +14K real-world projects, we found that (1) the tokens composing a method’s name can be frequently observed in its callers/callees; and (2) tokens extracted from different specific contexts have diverse probabilities to compose the target method’s name. Motivated by our findings, we propose, in this paper, a context-guided method name recommender, which mainly embodies two key ideas: (1) apart from the local context, which is extracted from the target method itself, we also consider the global context, which is extracted from other methods in the project that have call relations with the target method, to include more useful information; and (2) we utilize our empirical results as the prior knowledge to guide the generation of method names and also to restrict the number of tokens extracted from the global contexts. We implemented the idea as Cognac and performed extensive experiments to assess its effectiveness. Results reveal that can (1) perform better than existing approaches on the method name recommendation task (e.g., it achieves an F-score of 63.2%, 60.8%, 66.3%, and 68.5%, respectively, on four widely-used datasets, which all outperform existing techniques); and (2) achieve higher performance than existing techniques on the method name consistency checking task (e.g., its overall accuracy reaches 76.6%, outperforming the state-of-the-art MNire by 11.2%). Further results reveal that the caller/callee information and the prior knowledge all contribute significantly to the overall performance of Cognac.

To read or to rotate? comparing the effects of technical reading training and spatial skills training on novice programming ability

Understanding how to best support and train novice programmers is a critical component of producing better and more diverse software engineers. In this paper, we present the results of a controlled 11-week longitudinal study with 57 CS1 students comparing two skill-based interventions to improve programming performance. The first intervention involves spatial training, an established baseline known to be helpful in engineering contexts. The second intervention is a novel CS-focused technical reading training.

In our reading training, we teach strategies for summarizing scientific papers and understanding scientific charts and figures; most of the covered readings were CS1-accessible portions of computer science research papers. For the spatial training, we use a standardized training curriculum previously found to improve programming skills by focusing on spatial ability (i.e., the ability to mentally manipulate objects). We first replicate findings that both reading ability and spatial ability correlate with programming success. Significantly, however, we find that those in our reading training exhibit larger programming ability gains than those in the standard spatial training (p = 0.02, f2=0.10). We also find that reading trained participants perform particularly well on programming problems that require tracing through code (p = 0.03, f2=0.10). Our results suggest that technical reading training could be beneficial for novice programmers. Finally, we discuss the implications of our results for future CS1 interventions, the possibility for non-programming based training to positively impact developers, and future directions for software engineering education research.

Connecting the dots: rethinking the relationship between code and prose writing with functional connectivity

Medical imaging studies of software engineering have risen in popularity and may reveal the neural underpinnings of coding activities. To date, however, all studies in computer science venues have treated brain regions independently and in isolation. Since most complex neural activity involves coordination among multiple regions, previous analyses may overlook neural behavior.

We propose to apply functional connectivity analysis to medical imaging data from software engineering tasks. Informally, this analysis treats the brain as a graph, rather than a series of independent modules, and statistically infers relevant edges. We present a functional connectivity analysis of existing data, which elucidates the interconnections between code writing and prose writing, especially regarding higher mathematics and semantic processing. First, we found a significant link between Broca’s Area (language) and the Number Form Area (higher mathematics) for coding. This both refines previous interpretations that code writing and natural language are distinct from each other, and may also contribute to the understanding of the Number Form Area in the Psychology literature. Second, we identify an area with important functional connectivity for both prose writing and coding, unlike previous analyses that associated it with coding. This advances our neural understanding of coding and prose writing, and was only exposed by using functional connectivity analysis. Third, for coding, we find a strong functional connectivity result for a brain region involved in semantic processing for language, with implications for CS training. Finally, we find a neural relationship between coding and expertise, including a more grounded explanation than prior work.

SESSION: Software Security

LastPyMile: identifying the discrepancy between sources and packages

Open source packages have source code available on repositories for inspection (e.g. on GitHub) but developers use pre-built packages directly from the package repositories (such as npm for JavaScript, PyPI for Python, or RubyGems for Ruby). Such convenient practice assumes that there are no discrepancies between source code and packages. These differences pose both operational risks (e.g. making dependent projects unable to compile) and security risks (e.g. deploying malicious code during package installation) in the software supply chain. Our empirical assessment of 2438 popular packages in PyPI with an analysis of around 10M lines of code shows several differences in the wild: modifications cannot be just attributed to malicious injections. Yet, scanning again all and whole ‘most likely good but modified’ packages is hard to manage for FOSS downstream users. We propose a methodology, LastPyMile, for identifying the differences between build artifacts of software packages and the respective source code repository. We show how it can be used to extend current package scanning practices for malware injection (which only covers less than 1% of the code of deployed packages).

A grounded theory of the role of coordination in software security patch management

Several disastrous security attacks can be attributed to delays in patching software vulnerabilities. While researchers and practitioners have paid significant attention to automate vulnerabilities identification and patch development activities of software security patch management, there has been relatively little effort dedicated to gain an in-depth understanding of the socio-technical aspects, e.g., coordination of interdependent activities of the patching process and patching decisions, that may cause delays in applying security patches. We report on a Grounded Theory study of the role of coordination in security patch management. The reported theory consists of four inter-related dimensions, i.e., causes, breakdowns, constraints, and mechanisms. The theory explains the causes that define the need for coordination among interdependent software/hardware components and multiple stakeholders’ decisions, the constraints that can negatively impact coordination, the breakdowns in coordination, and the potential corrective measures. This study provides potentially useful insights for researchers and practitioners who can carefully consider the needs of and devise suitable solutions for supporting the coordination of interdependencies involved in security patch management.

TaintStream: fine-grained taint tracking for big data platforms through dynamic code translation

Big data has become valuable property for enterprises and enabled various intelligent applications. Today, it is common to host data in big data platforms (e.g., Spark), where developers can submit scripts to process the original and intermediate data tables. Meanwhile, it is highly desirable to manage the data to comply with various privacy requirements. To enable flexible and automated privacy policy enforcement, we propose TaintStream, a fine-grained taint tracking framework for Spark-like big data platforms. TaintStream works by automatically injecting taint tracking logic into the data processing scripts, and the injected scripts are dynamically translated to maintain a taint tag for each cell during execution. The dynamic translation rules are carefully designed to guarantee non-interference in the original data operation. By defining different semantics of taint tags, TaintStream can enable various data management applications such as access control, data retention, and user data erasure. Our experiments on a self-crafted benchmarksuite show that TaintStream is able to achieve accurate cell-level taint tracking with a precision of 93.0% and less than 15% overhead. We also demonstrate the usefulness of TaintStream through several real-world use cases of privacy policy enforcement.

SESSION: Debugging

Demystifying “bad” error messages in data science libraries

Error messages are critical starting points for debugging. Unfortunately, they seem to be notoriously cryptic, confusing, and uninformative. Yet, it still remains a mystery why error messages receive such bad reputations, especially given that they are merely very short pieces of natural language text. In this paper, we empirically demystify the causes and fixes of "bad" error messages, by qualitatively studying 201 Stack Overflow threads and 335 GitHub issues. We specifically focus on error messages encountered in data science development, which is an increasingly important but not well studied domain. We found that the causes of "bad" error messages are far more complicated than poor phrasing or flawed articulation of error message content. Many error messages are inherently and inevitably misleading or uninformative, since libraries do not know user intentions and cannot "see" external errors. Fixes to error-message-related issues mostly involve source code changes, while exclusive message content updates only take up a small portion. In addition, whether an error message is informative or helpful is not always clear-cut; even error messages that clearly pinpoint faults and resolutions can still cause confusion for certain users. These findings thus call for a more in-depth investigation on how error messages should be evaluated and improved in the future.

NIL: large-scale detection of large-variance clones

A code clone (in short, clone) is a code fragment that is identical or similar to other code fragments in source code. Clones generated by a large number of changes to copy-and-pasted code fragments are called large-variance (modifications are scattered) or large-gap (modifications are in one place) clones. It is difficult for general clone detection techniques to detect such clones and thus specialized techniques are necessary. In addition, with the rapid growth of software development, scalable clone detectors that can detect clones in large codebases are required. However, there are no existing techniques for quickly detecting large-variance or large-gap clones in large codebases. In this paper, we propose a scalable clone detection technique that can detect large-variance clones from large codebases and describe its implementation, called NIL. NIL is a token-based clone detector that efficiently identifies clone candidates using an N-gram representation of token sequences and an inverted index. Then, NIL verifies the clone candidates by measuring their similarity based on the longest common subsequence between their token sequences. We evaluate NIL in terms of large- variance clone detection accuracy, general Type-1, Type-2, and Type- 3 clone detection accuracy, and scalability. Our experimental results show that NIL has higher accuracy in terms of large-variance clone detection, equivalent accuracy in terms of general clone detection, and the shortest execution time for inputs of various sizes (1–250 MLOC) compared to existing state-of-the-art tools.

Understanding and detecting server-side request races in web applications

Modern web sites often run web applications on the server to handle HTTP requests from users and generate dynamic responses. Due to their concurrent nature, web applications are vulnerable to server-side request races. The problem becomes more severe with the ever-increasing popularity of web applications.

We first conduct a comprehensive characteristic study of 157 real-world server-side request races collected from different, popular types of web applications. The findings of this study can provide guidance for future development support in combating server-side request races.

Guided by our study results, we develop a dynamic framework, ReqRacer, for detecting and exposing server-side request races in web applications. We propose novel approaches to model happens-before relationships between HTTP requests, which are essential to web applications. Our evaluation shows that ReqRacer can effectively and efficiently detect known and unknown request races.

Detecting and localizing keyboard accessibility failures in web applications

The keyboard is the most universally supported input method operable by people with disabilities. Yet, many popular websites lack keyboard accessible mechanism, which could cause failures that make the website unusable. In this paper, we present a novel approach for automatically detecting and localizing keyboard accessibility failures in web applications. Our extensive evaluation of our technique on real world web pages showed that our technique was able to detect keyboard failures in web applications with high precision and recall and was able to accurately identify the underlying elements in the web pages that led to the observed problems.

Swarmbug: debugging configuration bugs in swarm robotics

Swarm robotics collectively solve problems that are challenging for individual robots, from environmental monitoring to entertainment. The algorithms enabling swarms allow individual robots of the swarm to plan, share, and coordinate their trajectories and tasks to achieve a common goal. Such algorithms rely on a large number of configurable parameters that can be tailored to target particular scenarios. This large configuration space, the complexity of the algorithms, and the dependencies with the robots’ setup and performance make debugging and fixing swarms configuration bugs extremely challenging. This paper proposes Swarmbug, a swarm debugging system that automatically diagnoses and fixes buggy behaviors caused by misconfiguration. The essence of Swarmbug is the novel concept called the degree of causal contribution (Dcc), which abstracts impacts of environment configurations (e.g., obstacles) to the drones in a swarm via behavior causal analysis. Swarmbug automatically generates, validates, and ranks fixes for configuration bugs. We evaluate Swarmbug on four diverse swarm algorithms. Swarmbug successfully fixes four configuration bugs in the evaluated algorithms, showing that it is generic and effective. We also conduct a real-world experiment with physical drones to show the Swarmbug’s fix is effective in the real-world.

Probabilistic Delta debugging

The delta debugging problem concerns how to reduce an object while preserving a certain property, and widely exists in many applications, such as compiler development, regression fault localization, and software debloating. Given the importance of delta debugging, multiple algorithms have been proposed to solve the delta debugging problem efficiently and effectively. However, the efficiency and effectiveness of the state-of-the-art algorithms are still not satisfactory. For example, the state-of-the-art delta debugging tool, CHISEL, may take up to 3 hours to reduce a single program with 14,092 lines of code, while the reduced program may be up to 2 times unnecessarily large.

In this paper, we propose a probabilistic delta debugging algorithm (named ProbDD) to improve the efficiency and the effectiveness of delta debugging. Our key insight is, the ddmin algorithm, the basic algorithm upon which many existing approaches are built, follows a predefined sequence of attempts to remove elements from a sequence, and fails to utilize the information from existing test results. To address this problem, ProbDD builds a probabilistic model to estimate the probabilities of the elements to be kept in the produced result, selects a set of elements to maximize the gain of the next test based on the model, and improves the model based on the test results.

We prove the correctness of ProbDD, and analyze the minimality of its result and the asymptotic number of tests under the worst case. The asymptotic number of tests in the worst case of ProbDD is O(n), which is smaller than that of ddmin, O(n2) worst-case asymptotic number of tests. Furthermore, we experimentally compared ProbDD with ddmin on 40 subjects in HDD and CHISEL, two approaches that wrap ddmin for reducing trees and C programs, respectively. The results show that, after replacing ddmin with ProbDD, HDD and CHISEL produce 59.48% and 11.51% smaller results and use 63.22% and 45.27% less time, respectively.

SESSION: Bug Characterization and Fixing

Finding broken Linux configuration specifications by statically analyzing the Kconfig language

Highly-configurable software underpins much of our computing infrastructure. It enables extensive reuse, but opens the door to broken configuration specifications. The configuration specification language, Kconfig, is designed to prevent invalid configurations of the Linux kernel from being built. However, the astronomical size of the configuration space for Linux makes finding specification bugs difficult by hand or with random testing. In this paper, we introduce a software model checking framework for building Kconfig static analysis tools. We develop a formal semantics of the Kconfig language and implement the semantics in a symbolic evaluator called kclause that models Kconfig behavior as logical formulas. We then design and implement a bug finder, called kismet, that takes kclause models and leverages automated theorem proving to find unmet dependency bugs. kismet is evaluated for its precision, performance, and impact on kernel development for a recent version of Linux, which has over 140,000 lines of Kconfig across 28 architecture-specific specifications. Our evaluation finds 781 bugs (151 when considering sharing among Kconfig specifications) with 100% precision, spending between 37 and 90 minutes for each Kconfig specification, although it misses some bugs due to underapproximation. Compared to random testing, kismet finds substantially more true positive bugs in a fraction of the time.

Semantic bug seeding: a learning-based approach for creating realistic bugs

When working on techniques to address the wide-spread problem of software bugs, one often faces the need for a large number of realistic bugs in real-world programs. Such bugs can either help evaluate an approach, e.g., in form of a bug benchmark or a suite of program mutations, or even help build the technique, e.g., in learning-based bug detection. Because gathering a large number of real bugs is difficult, a common approach is to rely on automatically seeded bugs. Prior work seeds bugs based on syntactic transformation patterns, which often results in unrealistic bugs and typically cannot introduce new, application-specific code tokens.

This paper presents SemSeed, a technique for automatically seeding bugs in a semantics-aware way. The key idea is to imitate how a given real-world bug would look like in other programs by semantically adapting the bug pattern to the local context. To reason about the semantics of pieces of code, our approach builds on learned token embeddings that encode the semantic similarities of identifiers and literals. Our evaluation with real-world JavaScript software shows that the approach effectively reproduces real bugs and clearly outperforms a semantics-unaware approach. The seeded bugs are useful as training data for learning-based bug detection, where they significantly improve the bug detection ability. Moreover, we show that SemSeed-created bugs complement existing mutation testing operators, and that our approach is efficient enough to seed hundreds of thousands of bugs within an hour.

SESSION: Mining Software Repositories

Characterizing search activities on stack overflow

To solve programming issues, developers commonly search on Stack Overflow to seek potential solutions. However, there is a gap between the knowledge developers are interested in and the knowledge they are able to retrieve using search engines. To help developers efficiently retrieve relevant knowledge on Stack Overflow, prior studies proposed several techniques to reformulate queries and generate summarized answers. However, few studies performed a large-scale analysis using real-world search logs. In this paper, we characterize how developers search on Stack Overflow using such logs. By doing so, we identify the challenges developers face when searching on Stack Overflow and seek opportunities for the platform and researchers to help developers efficiently retrieve knowledge. To characterize search activities on Stack Overflow, we use search log data based on requests to Stack Overflow's web servers. We find that the most common search activity is reformulating the immediately preceding queries. Related work looked into query reformulations when using generic search engines and found 13 types of query reformulation strategies. Compared to their results, we observe that 71.78% of the reformulations can be fitted into those reformulation strategies. In terms of how queries are structured, 17.41% of the search sessions only search for fragments of source code artifacts (e.g., class and method names) without specifying the names of programming languages, libraries, or frameworks. Based on our findings, we provide actionable suggestions for Stack Overflow moderators and outline directions for future research. For example, we encourage Stack Overflow to set up a database that includes the relations between all computer programming terminologies shared on Stack Overflow, e.g., method name, data structure name, design pattern, and IDE name. By doing so, Stack Overflow could improve the performance of search engines by considering related programming terminologies at different levels of granularity.

Authorship attribution of source code: a language-agnostic approach and applicability in software engineering

Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.

SESSION: Software Engineering for Machine Learning

Probing model signal-awareness via prediction-preserving input minimization

This work explores the signal awareness of AI models for source code understanding. Using a software vulnerability detection use case, we evaluate the models' ability to capture the correct vulnerability signals to produce their predictions. Our prediction-preserving input minimization (P2IM) approach systematically reduces the original source code to a minimal snippet which a model needs to maintain its prediction. The model's reliance on incorrect signals is then uncovered when the vulnerability in the original code is missing in the minimal snippet, both of which the model however predicts as being vulnerable. We measure the signal awareness of models using a new metric we propose -- Signal-aware Recall (SAR). We apply P2IM on three different neural network architectures across multiple datasets. The results show a sharp drop in the model's Recall from the high 90s to sub-60s with the new metric, highlighting that the models are presumably picking up a lot of noise or dataset nuances while learning their vulnerability detection logic. Although the drop in model performance may be perceived as an adversarial attack, but this isn't P2IM's objective. The idea is rather to uncover the signal-awareness of a black-box model in a data-driven manner via controlled queries. SAR's purpose is to measure the impact of task-agnostic model training, and not to suggest a shortcoming in the Recall metric. The expectation, in fact, is for SAR to match Recall in the ideal scenario where the model truly captures task-specific signals.

Generating efficient solvers from constraint models

Combinatorial problems (CPs) arise in many areas, and people use constraint solvers to automatically solve these problems. However, the state-of-the-art constraint solvers (e.g., Gecode and Chuffed) have overly complicated software architectures; they compute solutions inefficiently. This paper presents a novel and model-driven approach---SoGen---to synthesize efficient problem-specific solvers from constraint models. Namely, when users model a CP with our domain-specific language PDL (short for Problem Description Language), SoGen automatically analyzes various properties of the problem (e.g., search space, value boundaries, function monotonicity, and overlapping subproblems), synthesizes an efficient solver algorithm based on those properties, and generates a C program as the problem solver. PDL is unique because it can create solvers that resolve constraints via dynamic programming (DP) search.

For evaluation, we compared the solvers generated by SoGen with two state-of-the-art constraint solvers: Gecode and Chuffed. PDL's solvers resolved constraints more efficiently; they achieved up to 6,058x speedup over Gecode and up to 31,300x speedup over Chuffed. Additionally, we experimented with both SoGen and the state-of-the-art solver generator---Dominion. We found SoGen to generate solvers faster and the produced solvers are more efficient.

A comprehensive study of deep learning compiler bugs

There are increasing uses of deep learning (DL) compilers to generate optimized code, boosting the runtime performance of DL models on specific hardware. Like their traditional counterparts, DL compilers can generate incorrect code, resulting in unexpected model behaviors that may cause catastrophic consequences in mission-critical systems. On the other hand, the DL models processed by DL compilers differ fundamentally from imperative programs in that the program logic in DL models is implicit. As such, various characteristics of the bugs arising from traditional compilers need to be revisited in the context of DL compilers.

In this paper, we present the first systematic study of DL compiler bugs by analyzing 603 bugs arising in three popular DL compilers (i.e., TVM from Apache, Glow from Facebook, and nGraph from Intel). We analyzed these bugs according to their root causes, symptoms, and the stages where they occur during compilation. We obtain 12 findings, and provide a series of valuable guidelines for future work on DL compiler bug detection and debugging. For example, a large portion (nearly 20%) of DL compiler bugs are related to types, especially tensor types. The analysis of these bugs helps design new mutation operators (e.g., adding type cast for a tensor to promote implicit type conversion in subsequent tensor computations) to facilitate type-related bug detection. Further, we developed TVMfuzz as a proof-of-concept application of our findings to test the TVM DL compiler. It generates new tests based on TVM's original test suite. They expose 8 TVM bugs that are missed by the original test suite. The result demonstrates the usefulness of our findings.

Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline

In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine learning task, it is a common practice to build a pipeline that includes an ordered set of data preprocessing stages followed by a classifier. However, most of the research on fairness has considered a single classifier based prediction task. What are the fairness impacts of the preprocessing stages in machine learning pipeline? Furthermore, studies showed that often the root cause of unfairness is ingrained in the data itself, rather than the model. But no research has been conducted to measure the unfairness caused by a specific transformation made in the data preprocessing stage. In this paper, we introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline. We leveraged existing metrics to define the fairness measures of the stages. Then we conducted a detailed fairness evaluation of the preprocessing stages in 37 pipelines collected from three different sources. Our results show that certain data transformers are causing the model to exhibit unfairness. We identified a number of fairness patterns in several categories of data transformers. Finally, we showed how the local fairness of a preprocessing stage composes in the global fairness of the pipeline. We used the fairness composition to choose appropriate downstream transformer that mitigates unfairness in the machine learning pipeline.

Fairea: a model behaviour mutation approach to benchmarking bias mitigation methods

The increasingly wide uptake of Machine Learning (ML) has raised the significance of the problem of tackling bias (i.e., unfairness), making it a primary software engineering concern. In this paper, we introduce Fairea, a model behaviour mutation approach to benchmarking ML bias mitigation methods. We also report on a large-scale empirical study to test the effectiveness of 12 widely-studied bias mitigation methods. Our results reveal that, surprisingly, bias mitigation methods have a poor effectiveness in 49% of the cases. In particular, 15% of the mitigation cases have worse fairness-accuracy trade-offs than the baseline established by Fairea; 34% of the cases have a decrease in accuracy and an increase in bias.

Fairea has been made publicly available for software engineers and researchers to evaluate their bias mitigation methods.

SESSION: Software Evolution

Feature trace recording

Tracing requirements to their implementation is crucial to all stakeholders of a software development process. When managing software variability, requirements are typically expressed in terms of features, a feature being a user-visible characteristic of the software. While feature traces are fully documented in software product lines, ad-hoc branching and forking, known as clone-and-own, is still the dominant way for developing multi-variant software systems in practice. Retroactive migration to product lines suffers from uncertainties and high effort because knowledge of feature traces must be recovered but is scattered across teams or even lost. We propose a semi-automated methodology for recording feature traces proactively, during software development when the necessary knowledge is present. To support the ongoing development of previously unmanaged clone-and-own projects, we explicitly deal with the absence of domain knowledge for both existing and new source code. We evaluate feature trace recording by replaying code edit patterns from the history of two real-world product lines. Our results show that feature trace recording reduces the manual effort to specify traces. Recorded feature traces could improve automation in change-propagation among cloned system variants and could reduce effort if developers decide to migrate to a product line.

A longitudinal analysis of bloated Java dependencies

We study the evolution and impact of bloated dependencies in a single software ecosystem: Java/Maven. Bloated dependencies are third-party libraries that are packaged in the application binary but are not needed to run the application. We analyze the history of 435 Java projects. This historical data includes 48,469 distinct dependencies, which we study across a total of 31,515 versions of Maven dependency trees. Bloated dependencies steadily increase over time, and 89.2% of the direct dependencies that are bloated remain bloated in all subsequent versions of the studied projects. This empirical evidence suggests that developers can safely remove a bloated dependency. We further report novel insights regarding the unnecessary maintenance efforts induced by bloat. We find that 22% of dependency updates performed by developers are made on bloated dependencies, and that Dependabot suggests a similar ratio of updates on bloated dependencies.

SESSION: Software Practices

XAI tools in the public sector: a case study on predicting combined sewer overflows

Artificial intelligence and deep learning are becoming increasingly prevalent in contemporary software solutions. Explainable artificial intelligence (XAI) tools attempt to address the black box nature of the deep learning models and make them more understandable to humans. In this work, we apply three state-of-the-art XAI tools in a real-world case study. Our study focuses on predicting combined sewer overflow events for a municipal wastewater treatment organization. Through a data driven inquiry, we collect both qualitative information via stakeholder interviews and quantitative measures. These help us assess the predictive accuracy of the XAI tools, as well as the simplicity, soundness, and insightfulness of the produced explanations. Our results not only show the varying degrees that the XAI tools meet the requirements, but also highlight that domain experts can draw new insights from complex explanations that may differ from their previous expectations.

How disabled tests manifest in test maintainability challenges?

Software testing is an essential software quality assurance practice. Testing helps expose faults earlier, allowing developers to repair the code and reduce future maintenance costs. However, repairing (i.e., making failing tests pass) may not always be done immediately. Bugs may require multiple rounds of repairs and even remain unfixed due to the difficulty of bug-fixing tasks. To help test maintenance, along with code comments, the majority of testing frameworks (e.g., JUnit and TestNG) have also introduced annotations such as @Ignore to disable failing tests temporarily. Although disabling tests may help alleviate maintenance difficulties, they may also introduce technical debt. With the faster release of applications in modern software development, disabling tests may become the salvation for many developers to meet project deliverables. In the end, disabled tests may become outdated and a source of technical debt, harming long-term maintenance. Despite its harmful implications, there is little empirical research evidence on the prevalence, evolution, and maintenance of disabling tests in practice. To fill this gap, we perform the first empirical study on test disabling practice. We develop a tool to mine 122K commits and detect 3,111 changes that disable tests from 15 open-source Java systems. Our main findings are: (1) Test disabling changes are 19% more common than regular test refactorings, such as renames and type changes. (2) Our life-cycle analysis shows that 41% of disabled tests are never brought back to evaluate software quality, and most disabled tests stay disabled for several years. (3)We unveil the motivations behind test disabling practice and the associated technical debt by manually studying evolutions of 349 unique disabled tests, achieving a 95% confidence level and a 5% confidence interval. Finally, we present some actionable implications for researchers and developers.

SESSION: Software Processes

Sustainability forecasting for Apache incubator projects

Although OSS development is very popular, ultimately more than 80% of OSS projects fail. Identifying the factors associated with OSS success can help in devising interventions when a project takes a downturn. OSS success has been studied from a variety of angles, more recently in empirical studies of large numbers of diverse projects, using proxies for sustainability, e.g., internal metrics related to productivity and external ones, related to community popularity. The internal socio-technical structure of projects has also been shown important, especially their dynamics. This points to another angle on evaluating software success, from the perspective of self-sustaining and self-governing communities.

To uncover the dynamics of how a project at a nascent development stage gradually evolves into a sustainable one, here we apply a socio-technical network modeling perspective to a dataset of Apache Software Foundation Incubator (ASFI), sustainability-labeled projects. To identify and validate the determinants of sustainability, we undertake a mix of quantitative and qualitative studies of ASFI projects’ socio-technical network trajectories. We develop interpretable models which can forecast a project becoming sustainable with 93+% accuracy, within 8 months of incubation start. Based on the interpretable models we describe a strategy for real-time monitoring and suggesting actions, which can be used by projects to correct their sustainability trajectories.

SESSION: Test Generation

Graph-based seed object synthesis for search-based unit testing

Search-based software testing (SBST) generates tests using search algorithms guided by measurements gauging how far a test case is away from exercising a coverage goal. The effectiveness of SBST largely depends on the continuity and monotonicity of the fitness landscape decided by these measurements and the search operators. Unfortunately, the fitness landscape is challenging when the function under test takes object inputs, as classical measurement hardly provide guidance for constructing legitimate object inputs. To overcome this problem, we propose test seeds, i.e., test code skeletons of legitimate objects which enable the use of classical measurements. Given a target branch in a function under test, we first statically analyze the function to build an object construction graph that captures the relation between the operands of the target method and the states of their relevant object inputs. Based on the graph, we synthesize test template code where each "slot" is a mutation point for the search algorithm. This approach can be seamlessly integrated with existing SBST algorithms, and we implemented EvoObj on top of EvoSuite. Our experiments show that EvoObj outperforms EvoSuite with statistical significance on 2750 methods over 103 open source Java projects using state-of-the-art SBST algorithms.

LS-sampling: an effective local search based sampling approach for achieving high t-wise coverage

There has been a rapidly increasing demand for developing highly configurable software systems, which urgently calls for effective testing methods. In practice, t-wise coverage has been widely recognized as a useful metric to evaluate the quality of a test suite for testing highly configurable software systems, and achieving high t-wise coverage is important for ensuring test adequacy. However, state-of-the-art methods usually cost a fairly long time to generate large test suites for high pairwise coverage (i.e., 2-wise coverage), which would lead to ineffective and inefficient testing of highly configurable software systems. In this paper, we propose a novel local search based sampling approach dubbed LS-Sampling for achieving high t-wise coverage. Extensive experiments on a large number of public benchmarks, which are collected from real-world, highly configurable software systems, show that LS-Sampling achieves higher 2-wise and 3-wise coverage than the current state of the art. LS-Sampling is effective, since on average it achieves the 2-wise coverage of 99.64% and the 3-wise coverage of 97.87% through generating a small test suite consisting of only 100 test cases (90% smaller than the test suites generated by its state-of-the-art competitors). Furthermore, LS-Sampling is efficient, since it only requires an average execution time of less than one minute to generate a test suite with high 2-wise and 3-wise coverage.

GLIB: towards automated test oracle for graphically-rich applications

Graphically-rich applications such as games are ubiquitous with attractive visual effects of Graphical User Interface (GUI) that offers a bridge between software applications and end-users. However, various types of graphical glitches may arise from such GUI complexity and have become one of the main component of software compatibility issues. Our study on bug reports from game development teams in NetEase Inc. indicates that graphical glitches frequently occur during the GUI rendering and severely degrade the quality of graphically-rich applications such as video games. Existing automated testing techniques for such applications focus mainly on generating various GUI test sequences and check whether the test sequences can cause crashes. These techniques require constant human attention to captures non-crashing bugs such as bugs causing graphical glitches. In this paper, we present the first step in automating the test oracle for detecting non-crashing bugs in graphically-rich applications. Specifically, we propose GLIB based on a code-based data augmentation technique to detect game GUI glitches. We perform an evaluation of GLIB on 20 real-world game apps (with bug reports available) and the result shows that GLIB can achieve 100% precision and 99.5% recall in detecting non-crashing bugs such as game GUI glitches. Practical application of GLIB on another 14 real-world games (without bug reports) further demonstrates that GLIB can effectively uncover GUI glitches, with 48 of 53 bugs reported by GLIB having been confirmed and fixed so far.

SESSION: Metrics

Reassessing automatic evaluation metrics for code summarization tasks

In recent years, research in the domain of source code summarization has adopted data-driven techniques pioneered in machine translation (MT). Automatic evaluation metrics such as BLEU, METEOR, and ROUGE, are fundamental to the evaluation of MT systems and have been adopted as proxies of human evaluation in the code summarization domain. However, the extent to which automatic metrics agree with the gold standard of human evaluation has not been evaluated on code summarization tasks. Despite this, marginal improvements in metric scores are often used to discriminate between the performance of competing summarization models. In this paper, we present a critical exploration of the applicability and interpretation of automatic metrics as evaluation techniques for code summarization tasks. We conduct an empirical study with 226 human annotators to assess the degree to which automatic metrics reflect human evaluation. Results indicate that metric improvements of less than 2 points do not guarantee systematic improvements in summarization quality, and are unreliable as proxies of human evaluation. When the difference between metric scores for two summarization approaches increases but remains within 5 points, some metrics such as METEOR and chrF become highly reliable proxies, whereas others, such as corpus BLEU, remain unreliable. Based on these findings, we make several recommendations for the use of automatic metrics to discriminate model performance in code summarization.

SESSION: Programming Languages

Toward efficient interactions between Python and native libraries

Python has become a popular programming language because of its excellent programmability. Many modern software packages utilize Python for high-level algorithm design and depend on native libraries written in C/C++/Fortran for efficient computation kernels. Interaction between Python code and native libraries introduces performance losses because of the abstraction lying on the boundary of Python and native libraries. On the one side, Python code, typically run with interpretation, is disjoint from its execution behavior. On the other side, native libraries do not include program semantics to understand algorithm defects.

To understand the interaction inefficiencies, we extensively study a large collection of Python software packages and categorize them according to the root causes of inefficiencies. We extract two inefficiency patterns that are common in interaction inefficiencies. Based on these patterns, we develop PieProf, a lightweight profiler, to pinpoint interaction inefficiencies in Python applications. The principle of PieProf is to measure the inefficiencies in the native execution and associate inefficiencies with high-level Python code to provide a holistic view. Guided by PieProf, we optimize 17 real-world applications, yielding speedups up to 6.3× on application level.

Accelerating JavaScript static analysis via dynamic shortcuts

JavaScript has become one of the most widely used programming languages for web development, server-side programming, and even micro-controllers for IoT. However, its extremely functional and dynamic features degrade the performance and precision of static analysis. Moreover, the variety of built-in functions and host environments requires excessive manual modeling of their behaviors. To alleviate these problems, researchers have proposed various ways to leverage dynamic analysis during JavaScript static analysis. However, they do not fully utilize the high performance of dynamic analysis and often sacrifice the soundness of static analysis. In this paper, we present dynamic shortcuts, a new technique to flexibly switch between abstract and concrete execution during JavaScript static analysis in a sound way. It can significantly improve the analysis performance and precision by using highly-optimized commercial JavaScript engines and lessen the modeling efforts for opaque code. We actualize the technique via SAFEDS, an extended combination of SAFE and Jalangi, a static analyzer and a dynamic analyzer, respectively. We evaluated SAFEDS using 269 official tests of Lodash 4 library. Our experiment shows that SAFEDS is 7.81x faster than the baseline static analyzer, and it improves the precision to reduce failed assertions by 12.31% on average for 22 opaque functions.

SESSION: Approximations in Program Analysis and Testing

Skeletal approximation enumeration for SMT solver testing

Ensuring the equality of SMT solvers is critical due to its broad spectrum of applications in academia and industry, such as symbolic execution and program verification. Existing approaches to testing SMT solvers are either too costly or find difficulties generalizing to different solvers and theories, due to the test oracle problem. To complement existing approaches and overcome their weaknesses, this paper introduces skeletal approximation enumeration (SAE), a novel lightweight and general testing technique for all first-order theories. To demonstrate its practical utility, we have applied the SAE technique to test Z3 and CVC4, two comprehensively tested, state-of-the-art SMT solvers. By the time of writing, our approach had found 71 confirmed bugs in Z3 and CVC4,55 of which had already been fixed.

Boosting static analysis accuracy with instrumented test executions

The two broad approaches to discover properties of programs---static and dynamic analyses---have complementary strengths: static techniques perform exhaustive exploration and prove upper bounds on program behaviors, while the dynamic analysis of test cases provides concrete evidence of these behaviors and promise low false alarm rates. In this paper, we present DynaBoost, a system which uses information obtained from test executions to prioritize the alarms of a static analyzer. We instrument the program to dynamically look for dataflow behaviors predicted by the static analyzer, and use these results to bootstrap a probabilistic alarm ranking system, where the user repeatedly inspects the alarm judged most likely to be a real bug, and where the system re-ranks the remaining alarms in response to user feedback. The combined system is able to exploit information that cannot be easily provided by users, and provides significant improvements in the human alarm inspection burden: by 35% compared to the baseline ranking system, and by 89% compared to an unaided programmer triaging alarm reports.

Symbolic parallel adaptive importance sampling for probabilistic program analysis

Probabilistic software analysis aims at quantifying the probability of a target event occurring during the execution of a program processing uncertain incoming data or written itself using probabilistic programming constructs. Recent techniques combine symbolic execution with model counting or solution space quantification methods to obtain accurate estimates of the occurrence probability of rare target events, such as failures in a mission-critical system. However, they face several scalability and applicability limitations when analyzing software processing with high-dimensional and correlated multivariate input distributions. In this paper, we present SYMbolic Parallel Adaptive Importance Sampling (SYMPAIS), a new inference method tailored to analyze path conditions generated from the symbolic execution of programs with high-dimensional, correlated input distributions. SYMPAIS combines results from importance sampling and constraint solving to produce accurate estimates of the satisfaction probability for a broad class of constraints that cannot be analyzed by current solution space quantification methods. We demonstrate SYMPAIS's generality and performance compared with state-of-the-art alternatives on a set of problems from different application domains.

SESSION: Static Analysis and Symbolic Execution

IDE support for cloud-based static analyses

Integrating static analyses into continuous integration (CI) or continuous delivery (CD) has become the best practice for assuring code quality and security. Static Application Security Testing (SAST) tools fit well into CI/CD, because CI/CD allows time for deep static analyses on large code bases and prevents vulnerabilities in the early stages of the development lifecycle. In CI/CD, the SAST tools usually run in the cloud and provide findings via a web interface. Recent studies show that developers prefer seeing the findings of these tools directly in their IDEs. Most tools with IDE integration run lightweight static analyses and can give feedback at coding time, but SAST tools used in CI/CD take longer to run and usually are not able to do so. Can developers interact directly with a cloud-based SAST tool that is typically used in CI/CD through their IDE? We investigated if such a mechanism can integrate cloud-based SAST tools better into a developers’ workflow than web-based solutions. We interviewed developers to understand their expectations from an IDE solution. Guided by these interviews, we implemented an IDE prototype for an existing cloud-based SAST tool. With a usability test using this prototype, we found that the IDE solution promoted more frequent tool interactions. In particular, developers performed code scans three times more often. This indicates better integration of the cloud-based SAST tool into developers’ workflow. Furthermore, while our study did not show statistically significant improvement on developers’ code-fixing performance, it did show a promising reduction in time for fixing vulnerable code.

A bounded symbolic-size model for symbolic execution

Symbolic execution is a powerful program analysis technique which allows executing programs with symbolic inputs. Modern symbolic execution tools use a concrete modeling of object sizes, that does not allow symbolic-size allocations. This leads to concretizations and enforces the user to set the size of the input ahead of time, thus potentially leading to loss of coverage during the analysis. We present a bounded symbolic-size model in which the size of an object can have a range of values limited by a user-specified bound. Unfortunately, this model amplifies the problem of path explosion, due to additional symbolic expressions representing sizes. To cope with this problem, we propose an approach based on state merging that reduces the forking by applying special treatment to symbolic-size dependent loops. In our evaluation on real-world benchmarks, we show that our approach can lead in many cases to substantial gains in terms of performance and coverage, and find previously unknown bugs.

SESSION: Dynamic Analysis

Efficient module-level dynamic analysis for dynamic languages with module recontextualization

Dynamic program analysis is a long-standing technique for obtaining information about program execution. We present module recontextualization, a new dynamic analysis approach that targets modern dynamic languages such as JavaScript and Racket, enabled by the fact that they feature a module-import mechanism that loads code at runtime as a string. This approach uses lightweight load-time code transformations that operate on the string representation of the module, as well as the context to which it is about to be bound, to insert developer-provided, analysis-specific code into the module before it is loaded. This code implements the dynamic analysis, enabling this approach to capture all interactions around the module in unmodified production language runtime environments. We implement this approach in two systems targeting the JavaScript and Racket ecosystems. Our evaluation shows that this approach can deliver order-of-magnitude performance improvements over state-of-the-art dynamic analysis systems while supporting a range of analyses, implemented on average in about 100 lines of code.

SESSION: Industry Papers

Mono2Micro: a practical and effective tool for decomposing monolithic Java applications to microservices

In migrating production workloads to cloud, enterprises often face the daunting task of evolving monolithic applications toward a microservice architecture. At IBM, we developed a tool called Mono2Micro to assist with this challenging task. Mono2Micro performs spatio-temporal decomposition, leveraging well-defined business use cases and runtime call relations to create functionally cohesive partitioning of application classes. Our preliminary evaluation of Mono2Micro showed promising results.

How well does Mono2Micro perform against other decomposition techniques, and how do practitioners perceive the tool? This paper describes the technical foundations of Mono2Micro and presents results to answer these two questions. To answer the first question, we evaluated Mono2Micro against four existing techniques on a set of open-source and proprietary Java applications and using different metrics to assess the quality of decomposition and tool’s efficiency. Our results show that Mono2Micro significantly outperforms state-of-the-art baselines in specific metrics well-defined for the problem domain. To answer the second question, we conducted a survey of twenty-one practitioners in various industry roles who have used Mono2Micro. This study highlights several benefits of the tool, interesting practitioner perceptions, and scope for further improvements. Overall, these results show that Mono2Micro can provide a valuable aid to practitioners in creating functionally cohesive and explainable microservice decompositions.

Data-driven test selection at scale

Large-scale services depend on Continuous Integration/Continuous Deployment (CI/CD) processes to maintain their agility and code-quality. Change-based testing plays an important role in finding bugs, but testing after every change is prohibitively expensive at a scale where thousands of changes are committed every hour. Test selection models deal with this issue by running a subset of tests for every change.

In this paper, we present a generic, language-agnostic and lightweight statistical model for test selection. Unlike existing techniques, the proposed model does not require complex feature extraction techniques. Consequently, it scales to hundreds of repositories of varying characteristics while capturing more than 99% of buggy pull requests. Additionally, to better evaluate test selection models, we propose application-specific metrics that capture both a reduction in resource cost and a reduction in pull-request turn-around time. By evaluating our model on 22 large repositories at Microsoft, we find that we can save 15%−30% of compute time while reporting back more than ≈99% of buggy pull requests.

Effective low capacity status prediction for cloud systems

In cloud systems, an accurate capacity planning is very important for cloud provider to improve service availability. Traditional methods simply predicting "when the available resources is exhausted" are not effective due to customer demand fragmentation and platform allocation constraints. In this paper, we propose a novel prediction approach which proactively predicts the level of resource allocation failures from the perspective of low capacity status. By jointly considering the data from different sources in both time series form and static form, the proposed approach can make accurate LCS predictions in a complex and dynamic cloud environment, and thereby improve the service availability of cloud systems. The proposed approach is evaluated by real-world datasets collected from a large scale public cloud platform, and the results confirm its effectiveness.

Automated code transformation for context propagation in Go

Microservices architecture, which is increasingly being adopted by large technology companies, can accelerate development and deployment of backend code by partitioning a monolithic infrastructure into independent components. At the same time, microservices often compose into massively distributed systems, where analyzing the behavior of an individual service may not be enough to diagnose a performance regression or find a point of failure. Instead, a more global view of the entire computation may be required, where some form of global context is used to trace relevant information flowing through the system.

In the Go language, the recommended method of propagating this tracing context through service’s code is to pass it as the first parameter to all functions on call paths where the context is used. This kind of code transformation, in addition to modifying function calls and function signatures, may involve modifications to other language constructs – performing it manually can be very tedious, particularly for large existing services. In this paper we describe an automated code transformation tool supporting this style of context propagation. We describe the design and implementation of the tool and, based on a case study using real production services, demonstrate that the tool can on average eliminate 94% of manual effort required to propagate tracing context through the code of a given service.

Onion: identifying incident-indicating logs for cloud systems

In cloud systems, incidents affect the availability of services and require quick mitigation actions. Once an incident occurs, operators and developers often examine logs to perform fault diagnosis. However, the large volume of diverse logs and the overwhelming details in log data make the manual diagnosis process time-consuming and error-prone. In this paper, we propose Onion, an automatic solution for precisely and efficiently locating incident-indicating logs, which can provide useful clues for diagnosing the incidents. We first point out three criteria for localizing incident-indicating logs, i.e., Consistency, Impact, and Bilateral-Difference. Then we propose a novel agglomeration of logs, called log clique, based on which these criteria are satisfied. To obtain log cliques, we develop an incident-aware log representation and a progressive log clustering technique. Contrast analysis is then performed on the cliques to identify the incident-indicating logs. We have evaluated Onion using well-labeled log datasets. Onion achieves an average F1-score of 0.95 and can process millions of logs in only a few minutes, demonstrating its effectiveness and efficiency. Onion has also been successfully applied to the cloud system of Microsoft. Its practicability has been confirmed through the quantitative and qualitative analysis of the real incident cases.

Generating metamorphic relations for cyber-physical systems with genetic programming: an industrial case study

One of the major challenges in the verification of complex industrial Cyber-Physical Systems is the difficulty of determining whether a particular system output or behaviour is correct or not, the so-called test oracle problem. Metamorphic testing alleviates the oracle problem by reasoning on the relations that are expected to hold among multiple executions of the system under test, which are known as Metamorphic Relations (MRs). However, the development of effective MRs is often challenging and requires the involvement of domain experts. In this paper, we present a case study aiming at automating this process. To this end, we implemented GAssertMRs, a tool to automatically generate MRs with genetic programming. We assess the cost-effectiveness of this tool in the context of an industrial case study from the elevation domain. Our experimental results show that in most cases GAssertMRs outperforms the other baselines, including manually generated MRs developed with the help of domain experts. We then describe the lessons learned from our experiments and we outline the future work for the adoption of this technique by industrial practitioners.

Domain adaptation for an automated classification of deontic modalities in software engineering contracts

Contracts are agreements between parties engaging in economic transactions. They specify deontic modalities that the signatories should be held responsible for and state the penalties or actions to be taken if the stated agreements are not met. Additionally, contracts have also been known to be source of Software Engineering (SE) requirements. Identifying the deontic modalities in contracts can therefore add value to the Requirements Engineering (RE) phase of SE. The complex and ambiguous language of contracts make it difficult and time-consuming to identify the deontic modalities (obligations, permissions, prohibitions), embedded in the text. State-of-art neural network models are effective for text classification; however, they require substantial amounts of training data. The availability of contracts data is sparse owing to the confidentiality concerns of customers. In this paper, we leverage the linguistic and taxonomical similarities between regulations (available abundantly in the public domain) and contracts to demonstrate that it is possible to use regulations as training data for classifying deontic modalities in real-life contracts. We discuss the results of a range of experiments from the use of rule-based approach to Bidirectional Encoder Representations from Transformers (BERT) for automating the classification of deontic modalities. With BERT, we obtained an average precision and recall of 90% and 89.66% respectively.

How can manual testing processes be optimized? developer survey, optimization guidelines, and case studies

Manual software testing is tedious and costly as it involves significant human effort. Yet, it is still widely applied in industry and will be in the foreseeable future. Although there is arguably a great need for optimization of manual testing processes, research focuses mostly on optimization techniques for automated tests. Accordingly, there is no precise understanding of the practices and processes of manual testing in industry nor about pitfalls and optimization potential that is untapped. To shed light on this issue, we conducted a survey among 38 testing professionals from 16 companies, to investigate their manual testing processes and to identify potential for optimization. We synthesize guidelines when optimization techniques from automated testing can be implemented for manual testing. By means of case studies on two industrial software projects, we show that fault detection likelihood, test feedback time and test creation efforts can be improved when following our guidelines.

Turnover-induced knowledge loss in practice

When contributors to a software project leave, the knowledge they hold may become lost, thus impacting code quality and team productivity. Although well-known strategies can be used to mitigate knowledge loss, these strategies have to be tailored to their target context to be effective. To help software development organizations mitigate turnover-induced knowledge loss, we sought to better understand the different contexts in which developers experience this knowledge loss, and the resulting implications. We conducted qualitative interviews with 27 professional developers and managers from three different companies that provide software products and services. Leveraging the experience of these practitioners, we contribute a framework for characterizing turnover-induced knowledge loss and descriptions of the implications of knowledge loss, synthesized into 20 observations. These observations about knowledge loss in practice are organized into four themes, validated by the participants, and discussed within the context of the research literature in software engineering.

One thousand and one stories: a large-scale survey of software refactoring

Despite the availability of refactoring as a feature in popular IDEs, recent studies revealed that developers are reluctant to use them, and still prefer the manual refactoring of their code. At JetBrains, our goal is to fully support refactoring features in IntelliJ-based IDEs and improve their adoption in practice. Therefore, we start by raising the following main questions. How exactly do people refactor code? What refactorings are the most popular? Why do some developers tend not to use convenient IDE refactoring tools?

In this paper, we investigate the raised questions through the design and implementation of a survey targeting 1,183 users of IntelliJ-based IDEs. Our quantitative and qualitative analysis of the survey results shows that almost two-thirds of developers spend more than one hour in a single session refactoring their code; that refactoring types vary greatly in popularity; and that a lot of developers would like to know more about IDE refactoring features but lack the means to do so. These results serve us internally to support the next generation of refactoring features, as well as can help our research community to establish new directions in the refactoring usability research.

A comprehensive study on learning-based PE malware family classification methods

Driven by the high profit, Portable Executable (PE) malware has been consistently evolving in terms of both volume and sophistication. PE malware family classification has gained great attention and a large number of approaches have been proposed. With the rapid development of machine learning techniques and the exciting results they achieved on various tasks, machine learning algorithms have also gained popularity in the PE malware family classification task. Three mainstream approaches that use learning based algorithms, as categorized by the input format the methods take, are image-based, binary-based and disassembly-based approaches. Although a large number of approaches are published, there is no consistent comparisons on those approaches, especially from the practical industry adoption perspective. Moreover, there is no comparison in the scenario of concept drift, which is a fact for the malware classification task due to the fast evolving nature of malware. In this work, we conduct a thorough empirical study on learning-based PE malware classification approaches on 4 different datasets and consistent experiment settings. Based on the experiment results and an interview with our industry partners, we find that (1) there is no individual class of methods that significantly outperforms the others; (2) All classes of methods show performance degradation on concept drift (by an average F1-score of 32.23%); and (3) the prediction time and high memory consumption hinder existing approaches from being adopted for industry usage.

Infiltrating security into development: exploring the world’s largest software security study

Recent years have seen rapid increases in cybercrime. The use of effective software security activities plays an important part in preventing the harm involved. Objective research on industry use of software security practices is needed to help development teams, academic researchers, and educators to focus their activities.

Since 2008, a team of researchers, including two of the authors, has been gathering objective data on the use of 121 software security activities. The Building Security In Maturity Model (BSIMM) study explores the activity use of 675,000 software developers, in companies including some of the world’s largest and most security-focused.

Our analysis of the study data shows little consistent growth in security activity adoption industry-wide until 2015. Since then, the data shows a strong increasing trend, along with the adoption of new activities to support cloud-based deployment, an emphasis on component security, and a reduction in security professionals’ policing role. Exploring patterns of adoption, activities related to detecting and responding to vulnerabilities are adopted marginally earlier than activities related to preventing vulnerabilities; and activities related to particular job roles tend to be used together. We also found that 12 developer security activities are adopted early, together, and notably more often than any others.

From these results, we offer recommendations for software and security engineers, and corresponding education and research suggestions for academia. These recommendations offer a strong contribution to improving security in development teams in the future.

Data-driven extract method recommendations: a study at ING

The sound identification of refactoring opportunities is still an open problem in software engineering. Recent studies have shown the effectiveness of machine learning models in recommending methods that should undergo different refactoring operations. In this work, we experiment with such approaches to identify methods that should undergo an Extract Method refactoring, in the context of ING, a large financial organization. More specifically, we (i) compare the code metrics distributions, which are used as features by the models, between open-source and ING systems, (ii) measure the accuracy of different machine learning models in recommending Extract Method refactorings, (iii) compare the recommendations given by the models with the opinions of ING experts. Our results show that the feature distributions of ING systems and open-source systems are somewhat different, that machine learning models can recommend Extract Method refactorings with high accuracy, and that experts tend to agree with most of the recommendations of the model.

Duplicated code pattern mining in visual programming languages

Visual Programming Languages (VPLs), coupled with the high-level abstractions that are commonplace in visual programming environments, enable users with less technical knowledge to become proficient programmers. However, the lower skill floor required by VPLs also entails that programmers are more likely to not adhere to best practices of software development, producing systems with high technical debt, and thus poor maintainability. Duplicated code is one important example of such technical debt. In fact, we observed that the amount of duplication in the OutSystems VPL code bases can reach as high as 39%.

Duplicated code detection in text-based programming languages is still an active area of research with important implications regarding software maintainability and evolution. However, to the best of our knowledge, the literature on duplicated code detection for VPLs is very limited. We propose a novel and scalable duplicated code pattern mining algorithm that leverages the visual structure of VPLs in order to not only detect duplicated code, but also highlight duplicated code patterns that explain the reported duplication. The performance of the proposed approach is evaluated on a wide range of real-world mobile and web applications developed using OutSystems.

Making smart contract development more secure and easier

With the rapid development of distributed applications, smart contracts have attracted more and more developers' attentions. However, developers or domain experts have different levels of familiarity with specific programming languages, like Solidity, and those vulnerabilities hidden in the code would be exploited and result in huge property losses. Existing auxiliary tools lack security considerations. Most of them only provide word completion based on fuzzy search and detection services for limited types of vulnerabilities, which results in the manpower waste during coding and potential vulnerability threats after deployment.

In this work, we propose an integrated framework to enhance security in the two stages of recommendation and validation, assisting developers to implement more secure contracts more quickly. First, we reinforce original smart contracts with general patch patterns and secure programming standards for training, and design a real-time code suggestion algorithm to predict secure words for selection. Then, we integrate multiple widely-used testing tools to provide validation services. For evaluation, we collected 47,398 real-world contracts, and the result shows that it outperforms existing platforms and tools, improving the average word suggestion accuracy by 30%-60% and helping detect about 25%-61% more vulnerabilities. In most cases, our framework can correctly predict next words with the probability up to 82%-97% within top ten candidates. Compared with professional vulnerability mining tools, it can find more vulnerabilities and provide targeted modification suggestions without frivolous configurations. Currently, this framework has been used as the official development tool of WeBank and integrated as the recommended platform by FISCO-BCOS community.

Quantifying no-fault-found test failures to prioritize inspection of flaky tests at Ericsson

A test fails and despite an investigation by a developer there is no fault found (NFF). Large software systems are often released with known failing and flaky tests. In this work, we quantify how often a test fails and does not find a fault. We conduct a case study on 9.9 million test runs of 10k tests across four releases of a large project at Ericsson.

For each test, we mine the rate of NFF test failures over total runs for each release, i.e. NFFRate. We compare the current level of test failure with the number of NFF failures during the stabilization period of the prior release, i.e. StableNFFRate. Using the binomial distribution, we are able to determine which tests exhibit a statistically larger number of failures relative to their expected StableNFFRate. These unstable tests need to be prioritized for re-run and potentially investigated to determine if there is a fault or if the test needs to be fixed or modified.

Our work has had an impact on Ericsson’s testing practices with testers using the NFFRate to determine which tests are the “flakiest" and need to be fixed or moved into an earlier, virtualized unit test stage. Testers also use our tool and technique to prioritize the statistically unstable tests failures for investigation and to examine longterm trends of test failures that may indicate a fault.

When life gives you oranges: detecting and diagnosing intermittent job failures at Mozilla

Continuous delivery of cloud systems requires constant running of jobs (build processes, tests, etc.). One issue that plagues this continuous integration (CI) process are intermittent failures - non-deterministic, false alarms that do not result from a bug in the software or job specification, but rather from issues in the underlying infrastructure. At Mozilla, such intermittent failures are called oranges as a reference to the color of the build status indicator. As such intermittent failures disrupt CI and lead to failures, they erode the developers' trust in the jobs. We present a novel approach that automatically classifies failing jobs to determine whether job execution failures arise from an actual software bug or were caused by flakiness in the job (e.g., test) or the underlying infrastructure. For this purpose, we train classification models using job telemetry data to diagnose failure patterns involving features such as runtime, cpu load, operating system version, or specific platform with high precision. In an evaluation on a set of Mozilla CI jobs, our approach achieves precision scores of 73%, on average, across all data sets with some test suites achieving precision scores good enough for fully automated classification (i.e., precision scores of up to 100%), and recall scores of 82% on average (up to 94%).

FuzzBench: an open fuzzer benchmarking platform and service

Fuzzing is a key tool used to reduce bugs in production software. At Google, fuzzing has uncovered tens of thousands of bugs. Fuzzing is also a popular subject of academic research. In 2020 alone, over 120 papers were published on the topic of improving, developing, and evaluating fuzzers and fuzzing techniques. Yet, proper evaluation of fuzzing techniques remains elusive. The community has struggled to converge on methodology and standard tools for fuzzer evaluation.

To address this problem, we introduce FuzzBench as an open-source turnkey platform and free service for evaluating fuzzers. It aims to be easy to use, fast, reliable, and provides reproducible experiments. Since its release in March 2020, FuzzBench has been widely used both in industry and academia, carrying out more than 150 experiments for external users. It has been used by several published and in-the-work papers from academic groups, and has had real impact on the most widely used fuzzing tools in industry. The presented case studies suggest that FuzzBench is on its way to becoming a standard fuzzer benchmarking platform.

An empirical investigation of practical log anomaly detection for online service systems

Log data is an essential and valuable resource of online service systems, which records detailed information of system running status and user behavior. Log anomaly detection is vital for service reliability engineering, which has been extensively studied. However, we find that existing approaches suffer from several limitations when deploying them into practice, including 1) inability to deal with various logs and complex log abnormal patterns; 2) poor interpretability; 3) lack of domain knowledge. To help understand these practical challenges and investigate the practical performance of existing work quantitatively, we conduct the first empirical study and an experimental study based on large-scale real-world data. We find that logs with rich information indeed exhibit diverse abnormal patterns (e.g., keywords, template count, template sequence, variable value, and variable distribution). However, existing approaches fail to tackle such complex abnormal patterns, producing unsatisfactory performance. Motivated by obtained findings, we propose a generic log anomaly detection system named LogAD based on ensemble learning, which integrates multiple anomaly detection approaches and domain knowledge, so as to handle complex situations in practice. About the effectiveness of LogAD, the average F1-score achieves 0.83, outperforming all baselines. Besides, we also share some success cases and lessons learned during our study. To our best knowledge, we are the first to investigate practical log anomaly detection in the real world deeply. Our work is helpful for practitioners and researchers to apply log anomaly detection to practice to enhance service reliability.

RAPID: checking API usage for the cloud in the cloud

We present RAPID, an industrial-strength analysis developed at AWS that aims to help developers by providing automatic, fast and actionable feedback about correct usage of cloud-service APIs. RAPID’s design is based on the insight that cloud service APIs are structured around short-lived request- and response-objects whose usage patterns can be specified as value-dependent type-state automata and be verified by combining local type-state with global value-flow analyses. We describe various challenges that arose to deploy RAPID at scale. Finally, we present an evaluation that validates our design choices, deployment heuristics, and shows that RAPID is able to quickly and precisely report a wide variety of useful API misuse violations in large, industrial-strength code bases.

An empirical study of GUI widget detection for industrial mobile games

With the widespread adoption of smartphones in our daily life, mobile games experienced increasing demand over the past years. Meanwhile, the quality of mobile games has been continuously drawing more and more attention, which can greatly affect the player experience. For better quality assurance, general-purpose testing has been extensively studied for mobile apps. However, due to the unique characteristic of mobile games, existing mobile testing techniques may not be directly suitable and applicable. To better understand the challenges in mobile game testing, in this paper, we first initiate an early step to conduct an empirical study towards understanding the challenges and pain points of mobile game testing process at our industrial partner NetEase Games. Specifically, we first conduct a survey from the mobile test development team at NetEase Games via both scrum interviews and questionnaires. We found that accurate and effective GUI widget detection for mobile games could be the pillar to boost the automation of mobile game testing and other downstream analysis tasks in practice.

We then continue to perform comparative studies to investigate the effectiveness of state-of-the-art general-purpose mobile app GUI widget detection methods in the context of mobile games. To this end, we also develop a technique to automatically collect GUI widgets region information of industrial mobile games, which is equipped with a heuristic-based data cleaning method for quality refinement of the labeling results. Our evaluation shows that: (1) Existing GUI widget detection methods for general-purpose mobile apps cannot perform well on industrial mobile games. (2) Mobile game exhibits obvious difference from other general-purpose mobile apps in the perspective GUI widgets. Our further in-depth analysis reveals high diversity and density characteristics of mobile game GUI widgets could be the major reasons that post the challenges for existing methods, which calls for new research methods and better industry practices. To enable further research along this line, we construct the very first GUI widget detection benchmark, specially designed for mobile games, incorporating both our collected dataset and the state-of-the-art widget detection methods for mobile apps, which could also be the basis for further study of many downstream quality assurance tasks (e.g., testing and analysis) for mobile games.

Intelligent container reallocation at Microsoft 365

The use of containers in microservices has gained popularity as it facilitates agile development, resource governance, and software maintenance. Container reallocation aims to achieve workload balance via reallocating containers over physical machines. It affects the overall performance of microservice-based systems. However, container scheduling and reallocation remain an open issue due to their complexity in real-world scenarios. In this paper, we propose a novel Multi-Phase Local Search (MPLS) algorithm to optimize container reallocation. The experimental results show that our optimization algorithm outperforms state-of-the-art methods. In practice, it has been successfully applied to Microsoft 365 system to mitigate hotspot machines and balance workloads across the entire system.

Organizational implications of agile adoption: a case study from the public sector

While agile software development is increasingly adopted in large organizations, there is still a lack of studies on how traditionally organized enterprises adopt and scale agile forms of organization. This industrial multiple embedded case study explores how the organizational model of a large public sector entity evolved over four years to support the adoption of agile software development methods. Data was collected through semi-structured interviews and document analysis. We describe the change in three phases: pre-transformation, initial transformation, and maturing. Changes in three subcases of organizational units are further described in detail. Moving from an outsourced project-based way-of-working with separate business, IT and vendor organizations, the new organizational design emphasizes internal development capability, cross-functional autonomous teams organized around products and grouped in product areas, and continuous delivery. Starting from the IT department, the transformation expanded to the whole organization, and went beyond software development to the finance and leadership. We describe the target and intermediate organizations employed when adopting agile development methods for the whole organization and three organizational units responsible for different services. Defining suitable product boundaries, achieving alignment across teams, enhancing the competence of product owners, the coexistence of old and new types of systems, processes, and structures, and balancing the teams’ need for autonomy with the organizational needs for coordination and control are remaining challenges.

SESSION: Ideas, Visions, and Reflections

Towards immersive software archaeology: regaining legacy systems’ design knowledge via interactive exploration in virtual reality

Many of today's software systems will become the legacy systems of tomorrow, comprised of outdated technology and inaccurate design documents. Preparing for their eventual re-engineering requires engineers to regain lost design knowledge and discover re-engineering opportunities. While tools and visualizations exist, comprehending an unfamiliar code base remains challenging. Hence, software archaeology suffers from a considerable entry barrier as it requires expert knowledge, significant diligence, tenacity, and stamina. In this paper, we propose a paradigm shift in how legacy systems' design knowledge can be regained by presenting our vision for an immersive explorable software visualization in virtual reality (VR). We propose innovative concepts leveraging benefits of VR for a) immersion in an exoteric visualization metaphor, b) effective navigation and orientation, c) guiding exploration, and d) maintaining a link to the implementation. By enabling immersive and playful legacy system exploration, we strive for lowering the entry barrier, fostering long-term engagement, strengthening mental-model building, and improving knowledge retention in an effort to ease coping with the increased number of tomorrow's legacy systems.

Reducing the search space of bug inducing commits using failure coverage

Knowing how exactly a bug has been introduced into the code can help developers debug the bug efficiently. However, techniques currently used to retrieve Bug Inducing Commits (BICs) from the repository timeline are limited in their accuracy. Automated bisection of the version history depends on the bug revealing test case being executable against all candidate previous versions, whereas blaming the last commits that touched the same parts as the fixing commit (à la SZZ) requires that the bug has already been fixed. We show that filtering commits using the coverage of the bug revealing test cases can effectively reduce the search space for both bisection and SZZ-like blame models by 87.6% and 27.9%, respectively, significantly reducing the cost of BIC retrieval. The application of our approach to bugs in Defects4J also reveals inconsistencies in some of their BICs known in the literature.

The gas triangle and its challenges to the development of blockchain-powered applications

Ethereum is the most popular blockchain platform for the development of blockchain-powered applications (a.k.a, ). Developing a involves translating requests captured in the frontend of an application into contract transactions. However, transactions need to be payed for. Ethereum employs the gas system to charge transaction fees. The gas system has three key components, namely gas price, gas usage, and gas limit. We refer to these components and their interplay as the gas triangle. In this paper, we claim that the inherently complex gas triangle should not be exposed to end-users. We conduct two studies that provide empirical evidence to support our claim. In light of our results, we provide a list of recommendations to novice end-users. We conclude the paper with a list of research challenges that need to be tackled in order to support the development of next-generation that completely hide the gas triangle from end-users.

Selecting test inputs for DNNs using differential testing with subspecialized model instances

Testing of Deep Learning (DL) models is difficult due to the lack of automated test oracle and the high cost of human labelling. Differential testing has been used as a surrogate oracle, but there is no systematic guide on how to choose the reference model to use for differential testing. We propose a novel differential testing approach based on subspecialized models, i.e., models that are trained on sliced training data only (hence specialized for the slice). A preliminary evaluation of our approach with an CNN-based EMNIST image classifier shows that it can achieve higher error detection rate with selected inputs compared to using more advanced ResNet and LeNet as the reference model for differential testing. Our approach also outperforms N-version testing, i.e., the use of the same DL model architecture trained separately but using the same data.

Term interrelations and trends in software engineering

The Software Engineering (SE) community is prolific, making it challenging for experts to keep up with the flood of new papers and for neophytes to enter the field. Therefore, we posit that the community may benefit from a tool extracting terms and their interrelations from the SE community's text corpus and showing terms' trends. In this paper, we build a prototyping tool using the word embedding technique. We train the embeddings on the SE Body of Knowledge handbook and 15,233 research papers' titles and abstracts. We also create test cases necessary for validation of the training of the embeddings. We provide representative examples showing that the embeddings may aid in summarizing terms and uncovering trends in the knowledge base.

Software robustness: a survey, a theory, and prospects

If a software execution is disrupted, witnessing the execution at a later point may see evidence of the disruption or not. If not, we say the disruption failed to propagate. One name for this phenomenon is software robustness but it appears in different contexts in software engineering with different names. Contexts include testing, security, reliability, and automated code improvement or repair. Names include coincidental correctness, correctness attraction, transient error reliability. As witnessed, it is a dynamic phenomenon but any explanation with predictive power must necessarily take a static view. As a dynamic/static phenomenon it is convenient to take a statistical view of it which we do by way of information theory. We theorise that for failed disruption propagation to occur, a necessary condition is that the code region where the disruption occurs is composed with or succeeded by a subsequent code region that suffers entropy loss over all executions. The higher is the entropy loss, the higher the likelihood that disruption in the first region fails to propagate to the downstream observation point. We survey different research silos that address this phenomenon and explain how the theory might be exploited in software engineering.

Towards automating code review at scale

As neural methods are increasingly used to support and automate software development tasks, code review is a natural next target. Yet, training models to imitate developers based on past code reviews is far from straightforward: reviews found in open-source projects vary greatly in quality, phrasing, and depth depending on the reviewer. In addition, changesets are often large, stretching the capacity of current neural models. Recent work reported modest success at predicting review resolutions, but largely side-stepped the above issues by focusing on small inputs where comments were already known to occur. This work examines the vision and challenges of automating code review at realistic scale. We collect hundreds of thousands of changesets across hundreds of projects that routinely conduct code review, many of which change thousands of tokens. We focus on predicting just the locations of comments, which are quite rare. By analyzing model performance and dataset statistics, we show that even this task is already challenging, in no small part because of tremendous variation and (apparent) randomness in code reviews. Our findings give rise to a research agenda for realistically and impactfully automating code review.

Learning type annotation: is big data enough?

TypeScript is a widely used optionally-typed language where developers can adopt “pay as you go” typing: they can add types as desired, and benefit from static typing. The “type annotation tax” or manual effort required to annotate new or existing TypeScript can be reduced by a variety of automatic methods. Probabilistic machine-learning (ML) approaches work quite well. ML approaches use different inductive biases, ranging from simple token sequences to complex graphical neural network (GNN) models capturing syntax and semantic relations. More sophisticated inductive biases are hand-engineered to exploit the formal nature of software. Rather than deploying fancy inductive biases for code, can we just use “big data” to learn natural patterns relevant to typing? We find evidence suggesting that this is the case. We present TypeBert, demonstrating that even with simple token-sequence inductive bias used in BERT-style models and enough data, type-annotation performance of the most sophisticated models can be surpassed.

New visions on metamorphic testing after a quarter of a century of inception

Metamorphic testing (MT) was introduced about a quarter of a century ago. It is increasingly being accepted by researchers and the industry as a useful testing technique. The studies, research results, applications, and extensions of MT have given us many insights and visions for its future. Our visions include: MRs will be a practical means to top up test case generation techniques, beyond the alleviation of the test oracle problem; MT will not only be a standalone technique, but conveniently integrated with other methods; MT and MRs will evolve beyond software testing, or even beyond verification; MRs may be anything that you can imagine, beyond the necessary properties of algorithms; MT research will be beyond empirical studies and move toward a theoretical foundation; MT will not only bring new concepts to software testing but also new concepts to other disciplines; MRs will alleviate the reliable test set problem beyond traditional approaches. These visions may help researchers explore the challenges and opportunities for MT in the next decade.

Health of smart ecosystems

Software is a core component of smart ecosystems, large ’system communities’ that emerge from the composition of autonomous, independent, and highly heterogeneous systems, like smart cities, smart grids, smart buildings. The systems that comprise smart ecosystems are not centrally owned, and mutually interact both explicitly and implicitly, leading to unavoidable contradictions and failures. The distinctive characteristics of smart ecosystems challenge software engineers with problems never addressed so far. In this paper we discuss the big challenge of defining a new concept of ’dependability’ and new approaches to reveal smart ecosystem failures.

SESSION: Demonstrations

LLSC: a parallel symbolic execution compiler for LLVM IR

We present LLSC, a prototype compiler for nondeterministic parallel symbolic execution of the LLVM intermediate representation (IR). Given an LLVM IR program, LLSC generates code preserving the symbolic execution semantics and orchestrating solver invocations. The generated code runs efficiently, since the code has eliminated the interpretation overhead and explores multiple paths in parallel. To the best of our knowledge, LLSC is the first compiler for fork-based symbolic execution semantics that can generate parallel execution code.

In this demonstration paper, we present the current development and preliminary evaluation of LLSC. The principle behind LLSC is to automatically specialize a symbolic interpreter via the 1st Futamura projection, a fundamental connection between interpreters and compilers. The symbolic interpreter is written in an expressive high-level language equipped with a multi-stage programming facility. We demonstrate the run time performance through a set of benchmark programs, showing that LLSC outperforms interpretation-based symbolic execution engines in significant ways.

OwlEyes-online: a fully automated platform for detecting and localizing UI display issues

Graphical User Interface (GUI) provides visual bridges between software apps and end users. However, due to the compatibility of software or hardware, UI display issues such as text overlap, blurred screen, image missing always occur during GUI rendering on different devices. Because these UI display issues can be found directly by human eyes, in this paper, we implement an online UI display issue detection tool OwlEyes-Online, which provides a simple and easy-to-use platform for users to realize the automatic detection and localization of UI display issues. The OwlEyes-Online can automatically run the app and get its screenshots and XML files, and then detect the existence of issues by analyzing the screenshots. In addition, OwlEyes-Online can also find the detailed area of the issue in the given screenshots to further remind developers. Finally, OwlEyes-Online will automatically generate test reports with UI display issues detected in app screenshots and send them to users. The OwlEyes-Online was evaluated and proved to be able to accurately detect UI display issues. Tool Link: http://www.owleyes.online:7476 Github Link: https://github.com/franklinbill/owleyes Demo Video Link: https://youtu.be/002nHZBxtCY

Exploit those code reviews! bigger data for deeper learning

Modern code review (MCR) processes are prevalent in most organizations that develop software due to benefits in quality assurance and knowledge transfer. With the rise of collaborative software development platforms like GitHub and Bitbucket, today, millions of projects share not only their code but also their review data. Although researchers have tried to exploit this data for more than a decade, most of that knowledge remains a buried treasure. A crucial catalyst for many advances in deep learning, however, is the accessibility of large-scale standard datasets for different learning tasks. This paper presents the ETCR (Exploit Those Code Reviews!) infrastructure for mining MCR datasets from any GitHub project practicing pull-request-based development. We demonstrate its effectiveness with ETCR-Elasticsearch, a dataset of >231𝑘 review comments for >47𝑘 Java file revisions in >40𝑘 pull-requests from the Elasticsearch project. ETCR is designed with the challenge of deep learning in mind. Compared to previous datasets, ETCR datasets include all information for linking review comments to nodes in the respective program’s Abstract Syntax Tree.

BRAID: an API recommender supporting implicit user feedback

Efficient application programming interface (API) recommendation is one of the most desired features of modern integrated development environments. A multitude of API recommendation approaches have been proposed. However, most of the currently available API recommenders do not support the effective integration of user feedback into the recommendation loop. In this paper, we present BRAID (Boosting RecommendAtion with Implicit FeeDback), a tool which leverages user feedback, and employs learning-to-rank and active learning techniques to boost recommendation performance. The implementation is based on the VSCode plugin architecture, which provides an integrated user interface. Essentially, BRAID is a general framework which can accommodate existing query-based API recommendation approaches as components. Comparative experiments with strong baselines demonstrate the efficacy of the tool. A video demonstrating the usage of BRAID can be found at https://youtu.be/naD0guvl8sE.

KGAMD: an API-misuse detector driven by fine-grained API-constraint knowledge graph

Application Programming Interfaces (APIs) typically come with usage constraints. The violations of these constraints (i.e. API misuses) can cause significant problems in software development. Existing methods mine frequent API usage patterns from codebase to detect API misuses. They make a naive assumption that API usage that deviates from the most-frequent API usage is a misuse. However, there is a big knowledge gap between API usage patterns and API usage constraints in terms of comprehensiveness, explainability and best practices. Inspired by this, we propose a novel approach named KGAMD (API-Misuse Detector Driven by Fine-Grained API-Constraint Knowledge Graph) that detects API misuses directly against the API constraint knowledge, rather than API usage pat-terns. We first construct a novel API-constraint knowledge graph from API reference documentation with open information extraction methods. This knowledge graph explicitly models two types of API-constraint relations (call-order and condition-checking) and enriches return and throw relations with return conditions and exception triggers. Then, we develop the KGAMD tool that utilizes the knowledge graph to detect API misuses. There are three types of frequent API misuses we can detect - missing calls, missing condition checking and missing exception handling, while existing detectors mostly focus on only missing calls. Our quantitative evaluation and user study demonstrate that our KGAMD is promising in helping developers avoid and debug API misuses

Demo Video: https://www.youtube.com/watch?v=TN4LtHJ-494 IntelliJ plug-in: https://github.com/goodchar/KGAMD

Sangrahaka: a tool for annotating and querying knowledge graphs

We present a web-based tool Sangrahaka for annotating entities and relationships from text corpora towards construction of a knowledge graph and subsequent querying using templatized natural language questions. The application is language and corpus agnostic, but can be tuned for specific needs of a language or a corpus. The application is freely available for download and installation. Besides having a user-friendly interface, it is fast, supports customization, and is fault tolerant on both client and server side. It outperforms other annotation tools in an objective evaluation metric. The framework has been successfully used in two annotation tasks. The code is available from <a>https://github.com/hrishikeshrt/sangrahaka</a>.

Code2Que: a tool for improving question titles from mined code snippets in stack overflow

Stack Overflow is one of the most popular technical Q&A sites used by software developers. Seeking help from Stack Overflow has become an essential part of software developers’ daily work for solving programming-related questions. Although the Stack Overflow community has provided quality assurance guidelines to help users write better questions, we observed that a significant number of questions submitted to Stack Overflow are of low quality. In this paper, we introduce a new web-based tool, Code2Que, which can help developers in writing higher quality questions for a given code snippet. Code2Que consists of two main stages: offline learning and online recommendation. In the offline learning phase, we first collect a set of good quality ⟨code snippet, question⟩ pairs as training samples. We then train our model on these training samples via a deep sequence-to-sequence approach, enhanced with an attention mechanism, a copy mechanism and a coverage mechanism. In the online recommendation phase, for a given code snippet, we use the offline trained model to generate question titles to assist less experienced developers in writing questions more effectively. To evaluate Code2Que, we first sampled 50 low quality ⟨code snippet, question⟩ pairs from the Python and Java datasets on Stack Overflow. Then we conducted a user study to evaluate the question titles generated by our approach as compared to human-written ones using three metrics: Clearness, Fitness and Willingness to Respond. Our experimental results show that for a large number of low-quality questions in Stack Overflow, Code2Que can improve the question titles in terms of Clearness, Fitness and Willingness measures.

BF-detector: an automated tool for CI build failure detection

Continuous Integration (CI) aims at supporting developers in inte-grating code changes quickly through automated building. How-ever, there is a consensus that CI build failure is a major barrierthat developers face, which prevents them from proceeding furtherwith development. In this paper, we introduceBF-Detector, anautomated tool to detect CI build failure. Based on the adaptationof Non-dominated Sorting Genetic Algorithm (NSGA-II), our toolaims at finding the best prediction rules based on two conflictingobjective functions to deal with both minority and majority classes.We evaluated the effectiveness of our tool on a benchmark of 56,019CI builds. The results reveal that our technique outperforms state-of-the-art approaches by providing a better balance between bothfailed and passed builds.BF-Detectortool is publicly available,with a demo video, at: https://github.com/stilab-ets/BF-Detector.

AlloyFL: a fault localization framework for Alloy

Declarative models help improve the reliability of software systems: models can be used to convey requirements, analyze system designs and verify implementation properties. Alloy is a commonly used modeling language. A key strength of Alloy is the Analyzer, Alloy's integrated development environment (IDE), which allows users to write and execute models by leveraging a fully automatic SAT based analysis engine. Unfortunately, writing correct constraints of complex properties is difficult. To help users identify fault locations, AlloyFL is a fault localization technique that takes as input a faulty Alloy model and a fault-revealing test suite. As output, AlloyFL returns a ranked list of locations from most to least suspicious. This paper describes our Java implementation of AlloyFL as an extension to the Analyzer. Our experimental results show AlloyFL is capable of detecting the location of real world faults and works in the presence of multiple faulty locations. The demo video for AlloyFL can be found at https://youtu.be/ZwgP58Nsbx8.

BiasRV: uncovering biased sentiment predictions at runtime

Sentiment analysis (SA) systems, though widely applied in many domains, have been demonstrated to produce biased results. Some research works have been done in automatically generating test cases to reveal unfairness in SA systems, but the community still lacks tools that can monitor and uncover biased predictions at runtime. This paper fills this gap by proposing BiasRV, the first tool to raise an alarm when a deployed SA system makes a biased prediction on a given input text. To implement this feature, BiasRV dynamically extracts a template from an input text and generates gender-discriminatory mutants (semantically-equivalent texts that only differ in gender information) from the template. Based on popular metrics used to evaluate the overall fairness of an SA system, we define the distributional fairness property for an individual prediction of an SA system. This property specifies a requirement that for one piece of text, mutants from different gender classes should be treated similarly. Verifying the distributional fairness property causes much overhead to the running system. To run more efficiently, BiasRV adopts a two-step heuristic: (1) sampling several mutants from each gender and checking if the system predicts them as of the same sentiment, (2) checking distributional fairness only when sampled mutants have conflicting results. Experiments show that when compared to directly checking the distributional fairness property for each input text, our two-step heuristic can decrease the overhead used for analyzing mutants by 73.81% while only resulting in 6.7% of biased predictions being missed. Besides, BiasRV can be used conveniently without knowing the implementation of SA systems. Future researchers can easily extend BiasRV to detect more types of bias, e.g., race and occupation. The demo video for BiasRV can be viewed at https://youtu.be/WPe4Ml77d3U and the source code can be found at https://github.com/soarsmu/BiasRV.

ICME: an informed consent management engine for conformance in smart building environments

Smart buildings can reveal highly sensitive insights about their inhabitants and expose them to new privacy threats and vulnerabilities. Yet, convenience overrides privacy concerns and most people remain ignorant about this issue. We propose a novel Informed Consent Management Engine (ICME) that aims to: (a) increase users’ awareness about privacy issues and data collection practices in their smart building environments, (b) provide fine-grained visibility into privacy conformance and infringement by these devices, (c) recommend and visualise corrective user actions through ”digital nudging”, and (d) support the monitoring and management of personal data disclosure in a shared space. We present a reference architecture for ICME that can be used by software engineers to implement diverse end-user consent management solutions for smart buildings. We also provide a proof-of-concept prototype to demonstrate how the ICME approach works in a shared smart workplace. Demo: <a>https://youtu.be/5y6CdyWAdgY</a>

StackEmo: towards enhancing user experience by augmenting stack overflow with emojis

Many novice programmers visit Stack Overflow for purposes that include posing questions and finding answers for issues they come across in the process of programming. Many questions have more than one correct answer on Stack Overflow, which are accompanied by number of comments from the users. Comments help developers in identifying the answer that better fits their purpose. However, it is difficult to navigate through all the comments to select an answer. Adding relevant visual cues to comments could help developers in prioritizing the comments to be read. Comments logged generally include sentiments of users, which, when depicted visually, could motivate users in reading through the comments and also help them in prioritizing the comments. However, the sentiment of comments is not being explicitly depicted on the current Stack Overflow platform. While there exist many tools that augment or annotate Stack Overflow platform for developers, we are not aware of tools that annotate visual representations of sentiments to the posts. In this paper, we propose StackEmo as a Google Chrome plugin to augment comments on Stack Overflow with emojis, based on the sentiment of the comments posted. We evaluated StackEmo through an in-user likert scale based survey with 30 university students to understand user perception towards StackEmo. The results of the survey provided us insights on improving StackEmo, with 83% of the participants willing to recommend the plugin to their peers. The source code and tool are available for download on GitHub at: https://github.com/rishalab/StackEmo, and the demo can be found here on youtube: https://youtu.be/BCFlqvMhTMA.

AC²: towards understanding architectural changes in Python projects

Open source projects are adopting faster release cycles that reflect various changes in the software. Therefore, comprehending the effects of these changes as software architecture evolves over multiple releases becomes necessary. However, it is challenging to keep architecture in-check and add new changes simultaneously for every release. To this end, we propose a visualization tool called AC2, which allows users to examine the alterations in the architecture at both higher and lower levels of abstraction for Python projects. AC2 uses call graphs and collaboration graphs to show the interaction between different architectural components. The tool provides four different views to see the architectural changes. Users can examine two releases at a time to comprehend architectural changes between them. AC2 can support the maintainers and developers, observing changes in the project and their influence on the architecture, which allows them to examine its increasing complexity over many releases at component level. AC2 can be downloaded from <a>https://github.com/rishalab/AC2</a> and the demo can be seen at <a>https://www.youtube.com/watch?v=GNrJfZ0RCVI</a>.

csDetector: an open source tool for community smells detection

Community smells represent symptoms of sub-optimal organizational and social issues within software development communities that often lead to additional project costs and reduced software quality. Previous research identified a variety of community smells that are connected to sub-optimal patterns under different perspectives of organizational-social structures in the software development community. To detect community smells and understanding the characteristics of such organizational-social structures in a project, we propose csDetector, an open source tool that is able to automatically detect community smells within a project and provide relevant socio-technical metrics. csDetector uses a machine learning based detection approach that learns from various existing bad community development practices to provide automated support in detecting related community smells. We evaluate the effectiveness of csDetector on a benchmark of 143 open source projects from GitHub. Our results show that the csDetector tool can detect ten commonly occurring community smells in open software projects with an average F1 score of 84%. csDetector is publicly available, with a demo video, at: https://github.com/Nuri22/csDetector.

CrossVul: a cross-language vulnerability dataset with commit data

Examining the characteristics of software vulnerabilities and the code that contains them can lead to the development of more secure software. We present a dataset (∼1.4 GB) containing vulnerable source code files together with the corresponding, patched versions. Contrary to other existing vulnerability datasets, ours includes vulnerable files written in more than 40 programming languages. Each file is associated to (1) a Common Vulnerability Exposures identifier (CVE ID) and (2) the repository it came from. Further, our dataset can be the basis for machine learning applications that identify defects, as we show in specific examples. We also present a supporting dataset that contains commit messages derived from Git commits that serve as security patches. This dataset can be used to train ML models that in turn, can be used to detect security patch commits as we highlight in a specific use case.

Slicer4J: a dynamic slicer for Java

Dynamic program slicing is used in a variety of tasks, including program debugging and security analysis. Despite being extensively studied in the literature, the only dynamic slicing solution for Java programs that is publicly available today is a tool named JavaSlicer. Unfortunately, JavaSlicer only supports programs written in Java 6 or below and does not support multithreading. To address these limitations, this paper contributes a new dynamic slicing tool for Java, named Slicer4J. Slicer4J uses low-overhead instrumentation to collect a runtime execution trace; it then constructs a thread-aware, inter-procedural dynamic control-flow graph and uses the graph to compute the slice. To support slicing through Java framework methods and native code, Slicer4J relies on a set of pre-constructed data-flow summaries of the main framework methods. It also allows the users to further customize this set, adding user-defined methods when needed. We demonstrate the applicability of Slicer4J on ten benchmark and open-source Java programs, comparing it with JavaSlicer, and discuss how to use and extend the tool.

CrossASR++: a modular differential testing framework for automatic speech recognition

Developers need to perform adequate testing to ensure the quality of Automatic Speech Recognition (ASR) systems. However, manually collecting required test cases is tedious and time-consuming. Our recent work proposes CrossASR, a differential testing method for ASR systems. This method first utilizes Text-to-Speech (TTS) to generate audios from texts automatically and then feed these audios into different ASR systems for cross-referencing to uncover failed test cases. It also leverages a failure estimator to find failing test cases more efficiently. Such a method is inherently self-improvable: the performance can increase by leveraging more advanced TTS and ASR systems. So, in this accompanying tool demo paper, we further engineer CrossASR and propose CrossASR++, an easy-to-use ASR testing tool that can be conveniently extended to incorporate different TTS and ASR systems, and failure estimators. We also make CrossASR++ chunk texts from a given corpus dynamically and enable the estimator to work in a more effective and flexible way. We demonstrate that the new features can help CrossASR++ discover more failed test cases. Using the same TTS and ASR systems, CrossASR++ can uncover 26.2% more failed test cases for 4 ASRs than the original tool. Moreover, by simply adding one more ASR for cross-referencing, we can increase the number of failed test cases uncovered for each of the 4 ASR systems by 25.07%, 39.63%, 20.95% and 8.17% respectively. We also extend CrossASR++ with 5 additional failure estimators. Compared to worst estimator, the best one can discover 10.41% more failed test cases within the same amount of time. The demo video for CrossASR++ can be viewed at https://youtu.be/ddRk-f0QV-g and the source code can be found at https://github.com/soarsmu/CrossASRplus.

Frontmatter: mining Android user interfaces at scale

We introduce Frontmatter: the largest open-access dataset containing user interface models of about 160,000 Android apps. Frontmatter opens the door for comprehensive mining of mobile user interfaces, jumpstarting empirical research at a large scale, addressing questions such as "How many travel apps require registration?", "Which apps do not follow accessibility guidelines?", "Does the user interface correspond to the description?", and many more. The Frontmatter UI analysis tool and the Frontmatter dataset are available under an open-source license.

GenSys: a scalable fixed-point engine for maximal controller synthesis over infinite state spaces

The synthesis of maximally-permissive controllers in infinite-state systems has many practical applications. Such controllers directly correspond to maximal winning strategies in logically specified infinite-state two-player games. In this paper, we introduce a tool called GenSys which is a fixed-point engine for computing maximal winning strategies for players in infinite-state safety games. A key feature of GenSys is that it leverages the capabilities of existing off-the-shelf solvers to implement its fixed point engine. GenSys outperforms state-of-the-art tools in this space by a significant margin. Our tool has solved some of the challenging problems in this space, is scalable, and also synthesizes compact controllers. These controllers are comparatively small in size and easier to comprehend. GenSys is freely available for use and is available under an open-source license.

Analysis of specifications of multiparty sessions with dcj-lint

Multiparty session types constitute a method to automatically detect violations of protocol implementations relative to specifications. But, when a violation is detected, does it symptomise a bug in the implementation or in the specification? This paper presents dcj-lint: an analysis tool to detect bugs in protocol specifications, based on multiparty session types. By leveraging a custom-built temporal logic model checker, dcj-lint can be used to efficiently perform: (1) generic sanity checks, and (2) protocol-specific property analyses. In our benchmarks, dcj-lint outperforms an existing state-of-the-art model checker (up to 61x faster).

SESSION: Reuse, Reproduction, and Replication

Documenting evidence of a reuse of ‘a systematic study of the class imbalance problem in convolutional neural networks’

We report here the reuse of oversampling, and modifications to the basic approach, used in a recent TSE ’21 paper by YedidaMenzies. The method reused is the oversampling technique studied by Buda et al. These methods were studied in the SE domain (specifically, for defect prediction), and extended by Yedida & Menzies.

Documenting evidence of a reuse of ‘on the number of linear regions of deep neural networks’

We report here the reuse of theoretical insights from deep learning literature, used in a recent TSE '21 paper by Yedida & Menzies. The artifact replicated is the lower bound on the number of piecewise linear regions in the decision boundary of a feedforward neural network with ReLU activations, as studied by Montufar et al. We document the reuse of Theorem 4 from Montufar et al. by Yedida & Menzies.

Documenting evidence of a reuse of ‘a systematic literature review of techniques and metrics to reduce the cost of mutation testing’

This submission is a report on the reuse of Pizzoleto et al.'s Systematic Literature Review by Guizzo et al.

Documenting evidence of a reuse of ‘RefactoringMiner 2.0’

This submission is a report on the reuse of Tsantalis et al.'s Refactoring Miner (RMiner) package by Penta et al.

Documenting evidence of a reuse of ‘what is a feature? a qualitative study of features in industrial software product lines’

We report here the following example of reuse. The original paper is a prior work about features in product lines by Berger et al. The paper "Dimensions of software configuration: on the configuration context in modern software development" by Siegmund et al. reused definitions and theories about configuration features in the original paper.

Documenting evidence of a reuse of ‘“why should I trust you?”: explaining the predictions of any classifier’

We report here the following example of reuse. LIME is a local instance-based explanation generation framework that was originally proposed by Ribeiro et al. in their paper "'Why Should I Trust You?': Explaining the Predictions of Any Classifier". The framework was reused by Peng et al. in their paper "Defect Reduction Planning (using TimeLIME)". The paper used the original implementation of LIME as one of the core components in the proposed framework.

Documenting evidence of a replication of ‘populating a release history database from version control and bug tracking systems’

We report here the use of a keyword-based and regular expression-based approach to identify bug-fixing commits by linking commit messages and issue tracker data in a recent FSE '20 paper by Penta et al. in their paper "On the Relationship between Refactoring Actions and Bugs: A Differentiated Replication". The approach replicated is a keyword-based and regular expression-based approach as studied by Fischer et al.

Documenting evidence of a replication of ‘analyze this! 145 questions for data scientists in software engineering’

We report here the use of the 145 software engineering questions for data scientists presented in the Microsoft study in a recent FSE~'20 paper by Huijgens et al. The study by Begel et al. was replicated by Huijgens et al.

Documenting evidence of a reproduction of ‘is there a “golden” feature set for static warning identification? — an experimental evaluation’

We report here the use of the static analysis dataset generated by FindBugs in a recent EMSE '21 paper by Yang et al. The artifact reproduced is supervised models to perform static analysis based on a golden feature set as studied by Wang et al.

A replication of ‘DeepBugs: a learning approach to name-based bug detection’

We replicated the main result of DeepBugs, a bug detection algorithm for name-based bugs. The original authors evaluated it in three contexts: swapped-argument bugs, wrong binary operator,and wrong binary operator operands. We followed the algorithm and replicated the results for swapped-argument bugs. Our replication used independent implementations of the major components: training set generation, token vectorization, and neural network data pipeline, model, and loss function. Using the same dataset and the same testing process, we report comparable performance: within 2% of the accuracy reported by Pradel and Sen.

SESSION: Doctoral Symposium

Investigating documented information for accurate effort estimation in agile software development

An Agile development team estimates the effort of a work item to plan a sprint (i.e., an iteration in Scrum). Hence, reliable effort estimation would help the team create a reliable sprint plan. Prior studies proposed automated approaches to help the team to estimate the effort accurately. However, the effort estimated by the previously proposed approaches may become inaccurate when the related information is changed. Especially when the estimated effort is changed after the sprint has been planned, the sprint plan may be invalidated and the team might waste their time and effort spent in planning. This thesis aims to help the Agile development team improve the stability of effort estimation (in the Story Points unit) while aligning with the just-in-time practice of Agile. Hence, we first conduct empirical studies using mixed-methods approaches to investigate the potential impact of instability of Story Points. To help an Agile team to achieve reliable effort estimation with optimal effort, we will develop approaches to predict the future Story Points changes and the future information changes to help the team cope with uncertainty when finalizing the sprint plan.

Security guarantees for automated software testing

Before making important decisions about an ongoing fuzzing campaign, we believe an engineer may want to know: (i) the achieved level of confidence about the program’s correctness (residual risk), (ii) the expected increase in confidence about the program’s correctness if we invest more time for the current campaign (cost-benefit trade-off), and (iii) the total number of bugs that the fuzzer can find in the limit (effectiveness). The ability to accurately estimate the above quantities through observed data of the fuzzing campaign allows engineers to make required decisions with quantifiable accuracy. Currently, there are popular data-driven approaches to provide such quantitative guidance on decision making for white- and blackbox fuzzing campaigns. However, none of these prevailing techniques can guarantee unbiased estimation of residual risk, cost-benefit trade-off, or effectiveness for greybox fuzzing – the most popular automated software vulnerability discovery technique to date. Greybox fuzzers introduce an adaptive bias to existing estimators that needs to be corrected during the quantitative analysis to make accurate decisions about the campaign.

In this thesis, our primary objective is to develop a rich statistical framework that supports quantitative decision-making for greybox fuzzing campaigns. We leverage this framework to introduce appropriate bias correction strategies to existing estimators and propose novel estimators that account for adaptive bias in greybox fuzzing.

Unveiling multiple facets of design degradation in modern code review

Software design is a key concern in code review through which developers actively discuss and improve each code change. Nevertheless, code review is predominantly a cooperative task influenced by both technical and social aspects. Consequently, these aspects can play a key role in how software design degrades as well as contributing to accelerating or reversing the degradation during the process of each single code change’s review. However, there is little understanding about such social and technical aspects relates to either the reduction or the increase of design degradation as the project evolves. Consequently, the scarce knowledge on this topic helps little in properly guiding developers along design-driven code reviews. Our goal in this Doctoral research is three-fold: (1) to characterize the impact of code review and their practices on design degradation over time; (2) to understand the contribution of technical and social aspects to design degradation; and (3) to propose a conceptual framework to support design-decision making during code review. Our preliminary results show that the majority of code reviews had little to no design degradation impact, and that technical and social aspects contribute to distinguishing and predicting design impactful changes.

Freeing hybrid distributed AI training configuration

Deep neural network (DNN) has become the leading technology to realize Artificial Intelligence (AI). As DNN models become larger and more complex, so do datasets. Being able to efficiently train DNNs in parallel has become a crucial need. Data Parallelism (DP) is the widest-used solution today to accelerate DNN training but could be inefficient when processing DNNs with large-size parameters. Hybrid Parallelism (HP), which applies different parallel strategies on different parts of DNNs, is more efficient but requires advanced configurations. Not all AI researchers are experts in parallel computing, thus automating the configuration of HP strategies is very desirable for all AI frameworks. We propose a parallel semantics analysis method, which can analyze the trade-offs among different kinds of parallelisms and systematically choose the HP strategies with good training time performance. We demonstrated experimentally 260% speedup when applying our method compared to using a conventional DP approach. With our proposal, AI researchers would be able to focus more on AI algorithm research without being disturbed by parallel analysis and engineering concerns.

Towards an approach for resource-driven adaptation

Resource-driven systems have tasks that are bound by limited resources. These systems must adapt their tasks that cannot gain access to sufficient resources. This dissertation proposes a new resource-driven adaptation approach, which aims to support (1) task prioritisation using multiple criteria such as the time of day that a task should be executed, the role of involved users, and selection of the least costly adaptation types; (2) collaboration between a human user and a software tool for preparing adapted task behaviour to be used when resources are substituted; and (3) resource extensibility and heterogeneity. The proposed approach is being implemented and will be evaluated with scenarios from enterprise applications.

Deployment coordination for cross-functional DevOps teams

Software stability and reliability are the core concerns of DevOps. They are improved by tightening the collaboration between developers and operators in cross-functional teams on the one hand and by automating operations through continuous integration (CI) and infrastructure as code (IaC) on the other hand. Ideally, teams in DevOps are fully independent. Still, their applications often depend on each other in practice, requiring them to coordinate their deployment through centralization or manual coordination.

With this work, we propose and implement the novel IaC solution µs ([mju:z] ”muse”), which automates deployment coordination in a decentralized fashion. µs is the first approach that is compatible with the DevOps goals as it enables truly independent operations of the DevOps teams. We define our research problem through a questionnaire survey with IT professionals and evaluate the solution by comparing it to other modern IaC approaches, assessing its performance, and applying it to existing IaC programs.

Lightweight verification via specialized typecheckers

Testing and other unsound analyses are developer-friendly but cannot give guarantees that programs are free of bugs. Verification and other extant sound approaches can give guarantees but often require too much effort for everyday developers. In this work, we describe our efforts to make verification more accessible for developers by using specialized pluggable typecheckers---a relatively accessible verification technology---to solve complex problems that previously required more complex and harder-to-use verification approaches.

Multi-location cryptographic code repair with neural-network-based methodologies

Java Cryptographic API libraries are error-prone and result in vulnerabilities. The fixes of them often require security expertise and extra consideration for cryptographic consistency at multiple code locations. My Ph.D. research aims to help developers with a multi-location cryptographic code repair system. The proposed method relies on a precise static analysis for cryptographic code and a neural network based secure code generation solution. We focus on designing neural network based techniques guided by program analysis to learn from the secure code and give accurate suggestions. First, we conducted a comprehensive measurement to compare cryptographic API embeddings guided by different program analysis strategies. Then, we identified two previously unreported programming language-specific challenges, differentiating functionally similar APIs and capturing low-frequency code patterns. We address them by a specialized multi-path code suggestion architecture, and a novel low-frequency enhanced sequence learning technique. Existing results show that our approach achieves significant improvements on top-1 accuracy compared with the state-of-the-art.Our next step is an cryptographic consistent localization that enables our multi-location code repair. We publish our data and code as a large Java cryptographic code dataset.

Improving the effectiveness of peer code review in identifying security defects

Prior studies found peer code review useful in identifying security defects. That is why most of the commercial and open-source software (OSS) projects embraced peer code review and mandated the use of it in the software development life cycle. However, despite conducting mandatory peer code review practices, many security-critical OSS projects such as Chromium, Mozilla, and Qt are reporting a high number of post-release vulnerabilities to the Common Vulnerabilities and Exposures (CVE) database. Practitioners may wonder if there is any missing piece in the puzzle that leads code reviews to miss those security defects. Therefore, the primary objective of this dissertation study is to improve the effectiveness of peer code review in identifying security defects.

To meet this goal, I plan to empirically investigate: (i) why security defects escape code reviews, (ii) what are the challenges developers face to conduct effective security code reviews, (iii) how to build effective security code review strategy, and (iv) how to make effective utilization of security experts during code reviews.

Reducing cost in continuous integration with a collection of build selection approaches

Continuous integration (CI) is a widely used practice in modern software engineering. Unfortunately, it is also an expensive practice — Google and Mozilla estimate their CI systems in millions of dollars. To reduce CI computation cost, I propose the strategy of build selection to selectively execute those builds whose outcomes are failing and skip those passing builds for cost-saving. In my research, I firstly designed SmartBuildSkip as my first build selection approach that can skip unfruitful builds in CI automatically. Next, I evaluated SmartBuildSkip with all CI-improving approaches for understanding the strength and weakness of existing approaches to recommend future technique design. Then I proposed PreciseBuildSkip as a build selection approach to maximize the safety of skipping builds in CI. I also combined existing approaches both within and across granularity to be applied as a new build selection approach — HybridBuildSkip to save builds in a hybrid way. Finally, I plan to propose a human study to understand how to increase developers' trust on build selection approaches.

A live environment for inspection and refactoring of software systems

Refactoring helps to improve the design of software systems, making them more readable, maintainable, cleaner, and easy to expand. Most of the tools that already exist on this concept allow developers to select and execute the best refactoring techniques for a particular programming context. However, they aren’t interactive and prompt enough, providing a poor programming experience. In this gap, we can introduce and combine the topic of liveness with refactoring methods. Live Refactoring allows to know continuously, while programming, the blocks of code that we should refactor and why they were classified as problematic. Therefore, it shortens the time needed to create high-quality systems, due to early and continuous refactoring feedback, support, and guidance. This paper presents our research project based on a live refactoring environment. This environment is focused on a refactoring tool that aims to explore the concept of Live Refactoring and its main components --- recommendation, visualization, and application.

SESSION: Undergraduate Students

PorkFuzz: testing stateful software-defined network applications with property graphs

This paper proposes a state-aware fuzzing framework for testing software-defined network applications. It leverages a property graph to store fuzzing results. Application developers can easily express oracles with the graph query language to test their applications. The graph representation also allows the oracles to analyze the fuzzing result efficiently.

A qualitative study of cleaning in Jupyter notebooks

Data scientists commonly use computational notebooks because they provide a good environment for testing multiple models. However, once the scientist completes the code and finds the ideal model, the data scientist will have to dedicate time to clean up the code in order for others to understand it. In this paper, we perform a qualitative study on how scientists clean their code in hopes of being able to suggest a tool to automate this process. Our end goal is for tool builders to address possible gaps and provide additional aid to data scientists, who can then focus more on their actual work rather than the routine and tedious cleaning duties.

Automated generation of realistic test inputs for web APIs

Testing web APIs automatically requires generating input data values such as addressess, coordinates or country codes. Generating meaningful values for these types of parameters randomly is rarely feasible, which means a major obstacle for current test case generation approaches. In this paper, we present ARTE, the first semantic-based approach for the Automated generation of Realistic TEst inputs for web APIs. Specifically, ARTE leverages the specification of the API under test to search for meaningful test inputs for the API parameters in knowledge bases like DBpedia. Our approach has been integrated into RESTest, a state-of-the-art tool for API testing, achieving an unprecedented level of automation which allows to generate up to 100% more valid API calls than existing fuzzing techniques, 30% on average. Evaluation results on a set of 26 real-world APIs show that ARTE can generate realistic inputs for 7 out of every 10 parameters, outperforming related approaches.

Contextualizing toxicity in open source: a qualitative study

In this paper, we study toxic online interactions in issue discussions of open-source communities. Our goal is to qualitatively understand how toxicity impacts an open-source community like GitHub. We are driven by users complaining about toxicity, which leads to burnout and disengagement from the site. We collect a substantial sample of toxic interactions and qualitatively analyze their characteristics to ground future discussions and intervention design.

Accelerating redundancy-based program repair via code representation learning and adaptive patch filtering

Automated program repair (APR) has attracted extensive attention and many APR techniques have been proposed recently, in which redundancy-based techniques have achieved great success. However, they still suffer from the efficiency issue mainly caused by the inaccuracy of measuring code similarity, which may produce meaningless patches that hinder the generation and validation of correct patches. To solve this issue, we propose a novel method AccPR, which leverages code representation to measure code similarity and employs adaptive patch filtering to accelerate redundancy-based APR. We have implemented a prototype of AccPR and integrated it with a SOTA APR tool, SimFix, where the average improvement of efficiency is 47.85%, indicating AccPR is promising.

SMT solver testing with type and grammar based mutation

Satisfiability Modulo Theories (SMT) solvers are at the core of many software advances such as program analysis and verification which are highly safety-critical. Hence, to ensure the correctness of the solvers, there have been multiple fuzzing campaigns targeting different logics since 2009. In this paper, we propose a generative type-aware mutation strategy, which is a generalization of a type-aware operator mutation. We have realized the generative type-aware mutation and reported 158 bugs in Z3 and CVC4 including bugs from the versions released as early as 2016 in five months.

SESSION: Graduate Students

Overcoming metric diversity in meta-analysis for software engineering: proposed approach and a case study on its usage on the effects of software reuse

This work addresses the problem of metric diversity in meta-analysis for Software Engineering by clustering studies using input-output tables and by vote-counting. Diversity arises when researchers, measuring same phenomena, use different, and typically "incomparable" metrics making impossible a direct analysis of the effects and their sizes. Additionally, this work discusses an application of proposed approach to the case of Software Reuse.

A general approach to modeling Java framework behaviors

Interprocedural static analysis tools such as security analyses need good call graphs, which are challenging to scale for framework-based applications. So most tools model rather than analyzing frameworks. These models are manually crafted to capture framework semantics crucial for the particular analysis, and are inherently incomplete. We propose a general approach to modeling Java frameworks. It is not limited to any framework or analysis tool, therefore, highly reusable. While a generic approximation can be noisy, we show our carefully-constructed one does well. Experiments on Android with a client taint analysis show that our approach produces more complete call graphs than the original analysis. As a result, the client analysis works better: both precision (from 0.83 to 0.86) and recall (from 0.20 to 0.31) are improved.

Discovering repetitive code changes in ML systems

Similar to software evolution in other software systems, ML software systems evolve with many repetitive changes. Despite some research and tooling for repetitive code changes that exist in Java and other languages, there is a lack of such tools for Python. Given the significant rise of ML software development, and that many ML developers are not professionally trained developers, the lack of software evolution tools for ML code is even more critical. To bring the ML developers’ toolset into the 21st century, we implemented an approach to adapt and reuse the vast ecosystem of Java static analysis tools for Python. Using this approach, we adapted two software evolution tools, RefactoringMiner and CPATMiner, to Python. With the tools, we conducted the first and most fine-grained study on code change patterns in 59 ML systems and surveyed 253 developers. We recommend empirically-justified, actionable opportunities for tool builders and release the tools for researchers.

Does reusing pre-trained NLP model propagate bugs?

In this digital era, the textual content has become a seemingly ubiquitous part of our life. Natural Language Processing (NLP) empowers machines to comprehend the intricacies of textual data and eases human-computer interaction. Advancement in language modeling, continual learning, availability of a large amount of linguistic data, and large-scale computational power have made it feasible to train models for downstream tasks related to text analysis, including safety-critical ones, e.g., medical, airlines, etc. Compared to other deep learning (DL) models, NLP-based models are widely reused for various tasks. However, the reuse of pre-trained models in a new setting is still a complex task due to the limitations of the training dataset, model structure, specification, usage, etc. With this motivation, we study BERT, a vastly used language model (LM), from the direction of reusing in the code. We mined 80 posts from Stack Overflow related to BERT and found 4 types of bugs observed in clients’ code. Our results show that 13.75% are fairness, 28.75% are parameter, 15% are token, and 16.25% are version-related bugs.

Mitigating security attacks in kubernetes manifests for security best practices violation

Kubernetes is an open-source software system that helps practitioners in automatically deploying, scaling, and managing containerized applications. Information technology (IT) organizations, such as IBM, Spotify, and Capital One, use Kubernetes to manage their containers and reported benefits in the deployment process. However, recent security breaches and survey results among practitioners suggest that Kubernetes deployment can be vulnerable to attacks due to misconfiguration and not following security best practices. This research explores how malicious users can perform potential security exploits from the violations of Kubernetes security best practices. We explore how attacks can be conducted such as denial of service attacks against one of the security best practices violations in Kubernetes manifests. In addition, we are exploring potential exploits in the Kubernetes cluster to propose mitigation strategies to secure the Kubernetes cluster.