ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Full Citation in the ACM Digital Library

SESSION: Keynotes

Living with feature interactions (keynote)

Feature-oriented software development enables rapid software creation and evolution, through incremental and parallel feature development or through product line engineering. However, in practice, features are often not separate concerns. They behave differently in the presence of other features, and they sometimes interfere with each other in surprising ways.

This talk will explore challenges in feature interactions and their resolutions. Resolution strategies can tackle large classes of interactions, but are imperfect and incomplete, leading to research opportunities in software architecture, composition semantics, and verification.

Safety and robustness for deep learning with provable guarantees (keynote)

Computing systems are becoming ever more complex, with decisions increasingly often based on deep learning components. A wide variety of applications are being developed, many of them safety-critical, such as self-driving cars and medical diagnosis. Since deep learning is unstable with respect to adversarial perturbations, there is a need for rigorous software development methodologies that encompass machine learning components. This lecture will describe progress with developing automated verification and testing techniques for deep neural networks to ensure safety and robustness of their decisions with respect to input perturbations. The techniques exploit Lipschitz continuity of the networks and aim to approximate, for a given set of inputs, the reachable set of network outputs in terms of lower and upper bounds, in anytime manner, with provable guarantees. We develop novel algorithms based on feature-guided search, games, global optimisation and Bayesian methods, and evaluate them on state-of-the-art networks. The lecture will conclude with an overview of the challenges in this field.

Insights from open source software supply chains (keynote)

Open Source Software (OSS) forms an infrastructure on which numerous (often critical) software applications are based. Substantial research was done to investigate central projects such as Linux kernel but we have only a limited understanding of how the periphery of the larger OSS ecosystem is interconnected through technical dependencies, code sharing, and knowledge flows. We aim to close this gap by a) creating a nearly complete and rapidly updateable collection of version control data for FLOSS projects; b) by cleaning, correcting, and augmenting the data to measure several types of dependencies among code, developers, and projects; c) by creating models that rely on the resulting supply chains to investigate structural and dynamic properties of the entire OSS. The current implementation is capable of being updated each month, occupies over 300Tb of disk space with 1.5B commits and 12B git objects. Highly accurate algorithms to correct identity data and extract dependencies from the source code are used to characterize the current structure of OSS and the way it has evolved. In particular, models of technology spread demonstrate the implicit factors developers use when choosing software components. We expect the resulting research platform will both spur investigations on how the huge periphery in OSS both sustains and is sustained by the central OSS projects and, as a result, will increase resiliency and effectiveness of the OSS.

SESSION: Main Research

Concolic testing for models of state-based systems

Testing models of modern cyber-physical systems is not straightforward due to timing constraints, numerous if not infinite possible behaviors, and complex communications between components. Software testing tools and approaches that can generate test cases to test these systems are therefore important. Many of the existing automatic approaches support testing at the implementation level only. The existing model-level testing tools either treat the model as a black box (e.g., random testing approaches) or have limitations when it comes to generating complex test sequences (e.g., symbolic execution). This paper presents a novel approach and tool support for automatic unit testing of models of real-time embedded systems by conducting concolic testing, a hybrid testing technique based on concrete and symbolic execution. Our technique conducts automatic concolic testing in two phases. In the first phase, model is isolated from its environment, is transformed to a testable model and is integrated with a test harness. In the second phase, the harness tests the model concolically and reports the test execution results. We describe an implementation of our approach in the context of Papyrus-RT, an open source Model Driven Engineering (MDE) tool based on the modeling language UML-RT, and report the results of applying our concolic testing approach to a set of standard benchmark models to validate our approach.

Target-driven compositional concolic testing with function summary refinement for effective bug detection

Concolic testing is popular in unit testing because it can detect bugs quickly in a relatively small search space. But, in system-level testing, it suffers from the symbolic path explosion and often misses bugs. To resolve this problem, we have developed a focused compositional concolic testing technique, FOCAL, for effective bug detection. Focusing on a target unit failure v (a crash or an assert violation) detected by concolic unit testing, FOCAL generates a system-level test input that validates v. This test input is obtained by building and solving symbolic path formulas that represent system-level executions raising v. FOCAL builds such formulas by combining function summaries one by one backward from a function that raised v to main. If a function summary φa of function a conflicts with the summaries of the other functions, FOCAL refines φa to φa′ by applying a refining constraint learned from the conflict. FOCAL showed high system-level bug detection ability by detecting 71 out of the 100 real-world target bugs in the SIR benchmark, while other relevant cutting edge techniques (i.e., AFL-fast, KATCH, Mix-CCBSE) detected at most 40 bugs. Also, FOCAL detected 13 new crash bugs in popular file parsing programs.

Generating automated and online test oracles for Simulink models with continuous and uncertain behaviors

Test automation requires automated oracles to assess test outputs. For cyber physical systems (CPS), oracles, in addition to be automated, should ensure some key objectives: (i) they should check test outputs in an online manner to stop expensive test executions as soon as a failure is detected; (ii) they should handle time- and magnitude-continuous CPS behaviors; (iii) they should provide a quantitative degree of satisfaction or failure measure instead of binary pass/fail outputs; and (iv) they should be able to handle uncertainties due to CPS interactions with the environment. We propose an automated approach to translate CPS requirements specified in a logic-based language into test oracles specified in Simulink - a widely-used development and simulation language for CPS. Our approach achieves the objectives noted above through the identification of a fragment of Signal First Order logic (SFOL) to specify requirements, the definition of a quantitative semantics for this fragment and a sound translation of the fragment into Simulink. The results from applying our approach on 11 industrial case studies show that: (i) our requirements language can express all the 98 requirements of our case studies; (ii) the time and effort required by our approach are acceptable, showing potentials for the adoption of our work in practice, and (iii) for large models, our approach can dramatically reduce the test execution time compared to when test outputs are checked in an offline manner.

Lifting Datalog-based analyses to software product lines

Applying program analyses to Software Product Lines (SPLs) has been a fundamental research problem at the intersection of Product Line Engineering and software analysis. Different attempts have been made to ”lift” particular product-level analyses to run on the entire product line. In this paper, we tackle the class of Datalog-based analyses (e.g., pointer and taint analyses), study the theoretical aspects of lifting Datalog inference, and implement a lifted inference algorithm inside the Soufflé Datalog engine. We evaluate our implementation on a set of benchmark product lines. We show significant savings in processing time and fact database size (billions of times faster on one of the benchmarks) compared to brute-force analysis of each product individually.

An empirical study of real-world variability bugs detected by variability-oblivious tools

Many critical software systems developed in C utilize compile-time configurability. The many possible configurations of this software make bug detection through static analysis difficult. While variability-aware static analyses have been developed, there remains a gap between those and state-of-the-art static bug detection tools. In order to collect data on how such tools may perform and to develop real-world benchmarks, we present a way to leverage configuration sampling, off-the-shelf “variability-oblivious” bug detectors, and automatic feature identification techniques to simulate a variability-aware analysis. We instantiate our approach using four popular static analysis tools on three highly configurable, real-world C projects, obtaining 36,061 warnings, 80% of which are variability warnings. We analyze the warnings we collect from these experiments, finding that most results are variability warnings of a variety of kinds such as NULL dereference. We then manually investigate these warnings to produce a benchmark of 77 confirmed true bugs (52 of which are variability bugs) useful for future development of variability-aware analyses.

Principles of feature modeling

Feature models are arguably one of the most intuitive and successful notations for modeling the features of a variant-rich software system. Feature models help developers to keep an overall understanding of the system, and also support scoping, planning, development, variant derivation, configuration, and maintenance activities that sustain the system's long-term success. Unfortunately, feature models are difficult to build and evolve. Features need to be identified, grouped, organized in a hierarchy, and mapped to software assets. Also, dependencies between features need to be declared. While feature models have been the subject of three decades of research, resulting in many feature-modeling notations together with automated analysis and configuration techniques, a generic set of principles for engineering feature models is still missing. It is not even clear whether feature models could be engineered using recurrent principles. Our work shows that such principles in fact exist. We analyzed feature-modeling practices elicited from ten interviews conducted with industrial practitioners and from 31 relevant papers. We synthesized a set of 34 principles covering eight different phases of feature modeling, from planning over model construction, to model maintenance and evolution. Grounded in empirical evidence, these principles provide practical, context-specific advice on how to perform feature modeling, describe what information sources to consider, and highlight common characteristics of feature models. We believe that our principles can support researchers and practitioners enhancing feature-modeling tooling, synthesis, and analyses techniques, as well as scope future research.

Understanding GCC builtins to develop better tools

C programs can use compiler builtins to provide functionality that the C language lacks. On Linux, GCC provides several thousands of builtins that are also supported by other mature compilers, such as Clang and ICC. Maintainers of other tools lack guidance on whether and which builtins should be implemented to support popular projects. To assist tool developers who want to support GCC builtins, we analyzed builtin use in 4,913 C projects from GitHub. We found that 37% of these projects relied on at least one builtin. Supporting an increasing proportion of projects requires support of an exponentially increasing number of builtins; however, implementing only 10 builtins already covers over 30% of the projects. Since we found that many builtins in our corpus remained unused, the effort needed to support 90% of the projects is moderate, requiring about 110 builtins to be implemented. For each project, we analyzed the evolution of builtin use over time and found that the majority of projects mostly added builtins. This suggests that builtins are not a legacy feature and must be supported in future tools. Systematic testing of builtin support in existing tools revealed that many lacked support for builtins either partially or completely; we also discovered incorrect implementations in various tools, including the formally verified CompCert compiler.

Assessing the quality of the steps to reproduce in bug reports

A major problem with user-written bug reports, indicated by developers and documented by researchers, is the (lack of high) quality of the reported steps to reproduce the bugs. Low-quality steps to reproduce lead to excessive manual effort spent on bug triage and resolution. This paper proposes Euler, an approach that automatically identifies and assesses the quality of the steps to reproduce in a bug report, providing feedback to the reporters, which they can use to improve the bug report. The feedback provided by Euler was assessed by external evaluators and the results indicate that Euler correctly identified 98% of the existing steps to reproduce and 58% of the missing ones, while 73% of its quality annotations are correct.

A learning-based approach for automatic construction of domain glossary from source code and documentation

A domain glossary that organizes domain-specific concepts and their aliases and relations is essential for knowledge acquisition and software development. Existing approaches use linguistic heuristics or term-frequency-based statistics to identify domain specific terms from software documentation, and thus the accuracy is often low. In this paper, we propose a learning-based approach for automatic construction of domain glossary from source code and software documentation. The approach uses a set of high-quality seed terms identified from code identifiers and natural language concept definitions to train a domain-specific prediction model to recognize glossary terms based on the lexical and semantic context of the sentences mentioning domain-specific concepts. It then merges the aliases of the same concepts to their canonical names, selects a set of explanation sentences for each concept, and identifies "is a", "has a", and "related to" relations between the concepts. We apply our approach to deep learning domain and Hadoop domain and harvest 5,382 and 2,069 concepts together with 16,962 and 6,815 relations respectively. Our evaluation validates the accuracy of the extracted domain glossary and its usefulness for the fusion and acquisition of knowledge from different documents of different projects.

On using machine learning to identify knowledge in API reference documentation

Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) with deep learning approaches trained on manually-annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87

Generating query-specific class API summaries

Source code summaries are concise representations, in form of text and/or code, of complex code elements and are meant to help developers gain a quick understanding that in turns help them perform specific tasks. Generation of summaries that are task-specific is still a challenge in the automatic code summarization field. We propose an approach for generating on-demand, extrinsic hybrid summaries for API classes, relevant to a programming task, formulated as a natural language query. The summaries include the most relevant sentences extracted from the API reference documentation and the most relevant methods.

External evaluators assessed the summaries generated for classes retrieved from JDK and Android libraries for several programming tasks. The majority found that the summaries are complete, concise, and readable. A comparison with summaries produce by three baseline approaches revealed that the information present only in our summaries is more relevant than the one present only in the baselines summaries. Finally, an extrinsic evaluation study showed that the summaries help the users evaluating the correctness of API retrieval results, faster and more accurately.

Semantic relation based expansion of abbreviations

Identifiers account for 70% of source code in terms of characters, and thus the quality of such identifiers is critical for program comprehension and software maintenance. For various reasons, however, many identifiers contain abbreviations, which reduces the readability and maintainability of source code. To this end, a number of approaches have been proposed to expand abbreviations in identifiers. However, such approaches are either inaccurate or confined to specific identifiers. To this end, in this paper we propose a generic and accurate approach to expand identifier abbreviations. The key insight of the approach is that abbreviations in the name of software entity e have great chance to find their full terms in names of software entities that are semantically related to e. Consequently, the proposed approach builds a knowledge graph to represent such entities and their relationships with e, and searches the graph for full terms. The optimal searching strategy for the graph could be learned automatically from a corpus of manually expanded abbreviations. We evaluate the proposed approach on nine well known open-source projects. Results of our k-fold evaluation suggest that the proposed approach improves the state of the art. It improves precision significantly from 29% to 85%, and recall from 29% to 77%. Evaluation results also suggest that the proposed generic approach is even better than the state-of-the-art parameter-specific approach in expanding parameter abbreviations, improving F1 score significantly from 75% to 87%.

Diversity-based web test generation

Existing web test generators derive test paths from a navigational model of the web application, completed with either manually or randomly generated input values. However, manual test data selection is costly, while random generation often results in infeasible input sequences, which are rejected by the application under test. Random and search-based generation can achieve the desired level of model coverage only after a large number of test execution at- tempts, each slowed down by the need to interact with the browser during test execution. In this work, we present a novel web test generation algorithm that pre-selects the most promising candidate test cases based on their diversity from previously generated tests. As such, only the test cases that explore diverse behaviours of the application are considered for in-browser execution. We have implemented our approach in a tool called DIG. Our empirical evaluation on six real-world web applications shows that DIG achieves higher coverage and fault detection rates significantly earlier than crawling-based and search-based web test generators.

Web test dependency detection

E2E web test suites are prone to test dependencies due to the heterogeneous multi-tiered nature of modern web apps, which makes it difficult for developers to create isolated program states for each test case. In this paper, we present the first approach for detecting and validating test dependencies present in E2E web test suites. Our approach employs string analysis to extract an approximated set of dependencies from the test code. It then filters potential false dependencies through natural language processing of test names. Finally, it validates all dependencies, and uses a novel recovery algorithm to ensure no true dependencies are missed in the final test dependency graph. Our approach is implemented in a tool called TEDD and evaluated on the test suites of six open-source web apps. Our results show that TEDD can correctly detect and validate test dependencies up to 72% faster than the baseline with the original test ordering in which the graph contains all possible dependencies. The test dependency graphs produced by TEDD enable test execution parallelization, with a speed-up factor of up to 7×.

Testing scratch programs automatically

Block-based programming environments like Scratch foster engagement with computer programming and are used by millions of young learners. Scratch allows learners to quickly create entertaining programs and games, while eliminating syntactical program errors that could interfere with progress. However, functional programming errors may still lead to incorrect programs, and learners and their teachers need to identify and understand these errors. This is currently an entirely manual process. In this paper, we introduce a formal testing framework that describes the problem of Scratch testing in detail. We instantiate this formal framework with the Whisker tool, which provides automated and property-based testing functionality for Scratch programs. Empirical evaluation on real student and teacher programs demonstrates that Whisker can successfully test Scratch programs, and automatically achieves an average of 95.25 % code coverage. Although well-known testing problems such as test flakiness also exist in the scenario of Scratch testing, we show that automated and property-based testing can accurately reproduce and replace the manually and laboriously produced grading efforts of a teacher, and opens up new possibilities to support learners of programming in their struggles.

A large-scale empirical study of compiler errors in continuous integration

Continuous Integration (CI) is a widely-used software development practice to reduce risks. CI builds often break, and a large amount of efforts are put into troubleshooting broken builds. Despite that compiler errors have been recognized as one of the most frequent types of build failures, little is known about the common types, fix efforts and fix patterns of compiler errors that occur in CI builds of open-source projects. To fill such a gap, we present a large-scale empirical study on 6,854,271 CI builds from 3,799 open-source Java projects hosted on GitHub. Using the build data, we measured the frequency of broken builds caused by compiler errors, investigated the ten most common compiler error types, and reported their fix time. We manually analyzed 325 broken builds to summarize fix patterns of the ten most common compiler error types. Our findings help to characterize and understand compiler errors during CI and provide practical implications to developers, tool builders and researchers.

A statistics-based performance testing methodology for cloud applications

The low cost of resource ownership and flexibility have led users to increasingly port their applications to the clouds. To fully realize the cost benefits of cloud services, users usually need to reliably know the execution performance of their applications. However, due to the random performance fluctuations experienced by cloud applications, the black box nature of public clouds and the cloud usage costs, testing on clouds to acquire accurate performance results is extremely difficult. In this paper, we present a novel cloud performance testing methodology called PT4Cloud. By employing non-parametric statistical approaches of likelihood theory and the bootstrap method, PT4Cloud provides reliable stop conditions to obtain highly accurate performance distributions with confidence bands. These statistical approaches also allow users to specify intuitive accuracy goals and easily trade between accuracy and testing cost. We evaluated PT4Cloud with 33 benchmark configurations on Amazon Web Service and Chameleon clouds. When compared with performance data obtained from extensive performance tests, PT4Cloud provides testing results with 95.4% accuracy on average while reducing the number of test runs by 62%. We also propose two test execution reduction techniques for PT4Cloud, which can reduce the number of test runs by 90.1% while retaining an average accuracy of 91%. We compared our technique to three other techniques and found that our results are much more accurate.

How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

Towards more efficient meta-heuristic algorithms for combinatorial test generation

Combinatorial interaction testing (CIT) is a popular approach to detecting faults in highly configurable software systems. The core task of CIT is to generate a small test suite called a t-way covering array (CA), where t is the covering strength. Many meta-heuristic algorithms have been proposed to solve the constrained covering array generating (CCAG) problem. A major drawback of existing algorithms is that they usually need considerable time to obtain a good-quality solution, which hinders the wider applications of such algorithms. We observe that the high time consumption of existing meta-heuristic algorithms for CCAG is mainly due to the procedure of score computation. In this work, we propose a much more efficient method for score computation. The score computation method is applied to a state-of-the-art algorithm TCA, showing significant improvements. The new score computation method opens a way to utilize algorithmic ideas relying on scores which were not affordable previously. We integrate a gradient descent search step to further improve the algorithm, leading to a new algorithm called FastCA. Experiments on a broad range of real-world benchmarks and synthetic benchmarks show that, FastCA significantly outperforms state-of-the-art algorithms for CCAG algorithms, in terms of both the size of obtained covering array and the run time.

Compiler bug isolation via effective witness test program generation

Compiler bugs are extremely harmful, but are notoriously difficult to debug because compiler bugs usually produce few debugging information. Given a bug-triggering test program for a compiler, hundreds of compiler files are usually involved during compilation, and thus are suspect buggy files. Although there are lots of automated bug isolation techniques, they are not applicable to compilers due to the scalability or effectiveness problem. To solve this problem, in this paper, we transform the compiler bug isolation problem into a search problem, i.e., searching for a set of effective witness test programs that are able to eliminate innocent compiler files from suspects. Based on this intuition, we propose an automated compiler bug isolation technique, DiWi, which (1) proposes a heuristic-based search strategy to generate such a set of effective witness test programs via applying our designed witnessing mutation rules to the given failing test program, and (2) compares their coverage to isolate bugs following the practice of spectrum-based bug isolation. The experimental results on 90 real bugs from popular GCC and LLVM compilers show that DiWi effectively isolates 66.67%/78.89% bugs within Top-10/Top-20 compiler files, significantly outperforming state-of-the-art bug isolation techniques.

Concolic testing with adaptively changing search heuristics

We present Chameleon, a new approach for adaptively changing search heuristics during concolic testing. Search heuristics play a central role in concolic testing as they mitigate the path-explosion problem by focusing on particular program paths that are likely to increase code coverage as quickly as possible. A variety of techniques for search heuristics have been proposed over the past decade. However, existing approaches are limited in that they use the same search heuristics throughout the entire testing process, which is inherently insufficient to exercise various execution paths. Chameleon overcomes this limitation by adapting search heuristics on the fly via an algorithm that learns new search heuristics based on the knowledge accumulated during concolic testing. Experimental results show that the transition from the traditional non-adaptive approaches to ours greatly improves the practicality of concolic testing in terms of both code coverage and bug-finding.

Symbolic execution-driven extraction of the parallel execution plans of Spark applications

The execution of Spark applications is based on the execution order and parallelism of the different jobs, given data and available resources. Spark reifies these dependencies in a graph that we refer to as the (parallel) execution plan of the application. All the approaches that have studied the estimation of the execution times and the dynamic provisioning of resources for this kind of applications have always assumed that the execution plan is unique, given the computing resources at hand. This assumption is at least simplistic for applications that include conditional branches or loops and limits the precision of the prediction techniques.

This paper introduces SEEPEP, a novel technique based on symbolic execution and search-based test generation, that: i) automatically extracts the possible execution plans of a Spark application, along with dedicated launchers with properly synthesized data that can be used for profiling, and ii) tunes the allocation of resources at runtime based on the knowledge of the execution plans for which the path conditions hold. The assessment we carried out shows that SEEPEP can effectively complement dynaSpark, an extension of Spark with dynamic resource provisioning capabilities, to help predict the execution duration and the allocation of resources.

Generating effective test cases for self-driving cars from police reports

Autonomous driving carries the promise to drastically reduce the number of car accidents; however, recently reported fatal crashes involving self-driving cars show that such an important goal is not yet achieved. This calls for better testing of the software controlling self-driving cars, which is difficult because it requires producing challenging driving scenarios. To better test self-driving car soft- ware, we propose to specifically test car crash scenarios, which are critical par excellence. Since real car crashes are difficult to test in field operation, we recreate them as physically accurate simulations in an environment that can be used for testing self-driving car software. To cope with the scarcity of sensory data collected during real car crashes which does not enable a full reproduction, we extract the information to recreate real car crashes from the police reports which document them. Our extensive evaluation, consisting of a user study involving 34 participants and a quantitative analysis of the quality of the generated tests, shows that we can generate accurate simulations of car crashes in a matter of minutes. Compared to tests which implement non critical driving scenarios, our tests effectively stressed the test subject in different ways and exposed several shortcomings in its implementation.

Preference-wise testing for Android applications

Preferences, the setting options provided by Android, are an essential part of Android apps. Preferences allow users to change app features and behaviors dynamically, and therefore, need to be thoroughly tested. Unfortunately, the specific preferences used in test cases are typically not explicitly specified, forcing testers to manually set options or blindly try different option combinations. To effectively test the impacts of different preference options, this paper presents PREFEST, as a preference-wise enhanced automatic testing approach, for Android apps. Given a set of test cases, PREFEST can locate the preferences that may affect the test cases with a static and dynamic combined analysis on the app under test, and execute these test cases only under necessary option combinations. The evaluation shows that PREFEST can improve 6.8% code coverage and 12.3% branch coverage and find five more real bugs compared to testing with the original test cases. The test cost is reduced by 99% for both the number of test cases and the testing time, compared to testing under pairwise combination of options.

Bisecting commits and modeling commit risk during testing

Software testing is one of the costliest stages in the software development life cycle. One approach to reducing the test execution cost is to group changes and test them as a batch (i.e. batch testing). However, when tests fail in a batch, commits in the batch need to be re-tested to identify the cause of the failure, i.e. the culprit commit. The re-testing is typically done through bisection (i.e. a binary search through the commits in a batch). Intuitively, the effectiveness of batch testing highly depends on the size of the batch. Larger batches require fewer initial test runs, but have a higher chance of a test failure that can lead to expensive test re-runs to find the culprit. We are unaware of research that investigates and simulates the impact of batch sizes on the cost of testing in industry. In this work, we first conduct empirical studies on the effectiveness of batch testing in three large-scale industrial software systems at Ericsson. Using 9 months of testing data, we simulate batch sizes from 1 to 20 and find the most cost-effective BatchSize for each project. Our results show that batch testing saves 72% of test executions compared to testing each commit individually. In a second simulation, we incorporate flaky tests that pass and fail on the same commit as they are a significant source of additional test executions on large projects. We model the degree of flakiness for each project and find that test flakiness reduces the cost savings to 42%. In a third simulation, we guide bisection to reduce the likelihood of batch-testing failures. We model the riskiness of each commit in a batch using a bug model and a test execution history model. The risky commits are tested individually, while the less risky commits are tested in a single larger batch. Culprit predictions with our approach reduce test executions up to 9% compared to Ericsson’s current bisection approach. The results have been adopted by developers at Ericsson and a tool to guide bisection is in the process of being added to Ericsson’s continuous integration pipeline.

White-box testing of big data analytics with complex user-defined functions

Data-intensive scalable computing (DISC) systems such as Google’s MapReduce, Apache Hadoop, and Apache Spark are being leveraged to process massive quantities of data in the cloud. Modern DISC applications pose new challenges in exhaustive, automatic testing because they consist of dataflow operators, and complex user-defined functions (UDF) are prevalent unlike SQL queries. We design a new white-box testing approach, called BigTest to reason about the internal semantics of UDFs in tandem with the equivalence classes created by each dataflow and relational operator. Our evaluation shows that, despite ultra-large scale input data size, real world DISC applications are often significantly skewed and inadequate in terms of test coverage, leaving 34% of Joint Dataflow and UDF (JDU) paths untested. BigTest shows the potential to minimize data size for local testing by 10^5 to 10^8 orders of magnitude while revealing 2X more manually-injected faults than the previous approach. Our experiment shows that only few of the data records (order of tens) are actually required to achieve the same JDU coverage as the entire production data. The reduction in test data also provides CPU time saving of 194X on average, demonstrating that interactive and fast local testing is feasible for big data analytics, obviating the need to test applications on huge production data.

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts

In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 2,141 bugs from 5 benchmarks. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts, which we used to find causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their approaches and tools.

iFixR: bug report driven program repair

Issue tracking systems are commonly used in modern software development for collecting feedback from users and developers. An ultimate automation target of software maintenance is then the systematization of patch generation for user-reported bugs. Although this ambition is aligned with the momentum of automated program repair, the literature has, so far, mostly focused on generate-and- validate setups where fault localization and patch generation are driven by a well-defined test suite. On the one hand, however, the common (yet strong) assumption on the existence of relevant test cases does not hold in practice for most development settings: many bugs are reported without the available test suite being able to reveal them. On the other hand, for many projects, the number of bug reports generally outstrips the resources available to triage them. Towards increasing the adoption of patch generation tools by practitioners, we investigate a new repair pipeline, iFixR, driven by bug reports: (1) bug reports are fed to an IR-based fault localizer; (2) patches are generated from fix patterns and validated via regression testing; (3) a prioritized list of generated patches is proposed to developers. We evaluate iFixR on the Defects4J dataset, which we enriched (i.e., faults are linked to bug reports) and carefully-reorganized (i.e., the timeline of test-cases is naturally split). iFixR generates genuine/plausible patches for 21/44 Defects4J faults with its IR-based fault localizer. iFixR accurately places a genuine/plausible patch among its top-5 recommendation for 8/13 of these faults (without using future test cases in generation-and-validation).

Exploring and exploiting the correlations between bug-inducing and bug-fixing commits

Bug-inducing commits provide important information to understand when and how bugs were introduced. Therefore, they have been extensively investigated by existing studies and frequently leveraged to facilitate bug fixings in industrial practices.

Due to the importance of bug-inducing commits in software debugging, we are motivated to conduct the first systematic empirical study to explore the correlations between bug-inducing and bug-fixing commits in terms of code elements and modifications. To facilitate the study, we collected the inducing and fixing commits for 333 bugs from seven large open-source projects. The empirical findings reveal important and significant correlations between a bug's inducing and fixing commits. We further exploit the usefulness of such correlation findings from two aspects. First, they explain why the SZZ algorithm, the most widely-adopted approach to collecting bug-inducing commits, is imprecise. In view of SZZ's imprecision, we revisited the findings of previous studies based on SZZ, and found that 8 out of 10 previous findings are significantly affected by SZZ's imprecision. Second, they shed lights on the design of automated debugging techniques. For demonstration, we designed approaches that exploit the correlations with respect to statements and change actions. Our experiments on Defects4J show that our approaches can boost the performance of fault localization significantly and also advance existing APR techniques.

Effects of explicit feature traceability on program comprehension

Developers spend a substantial amount of their time with program comprehension. To improve their comprehension and refresh their memory, developers need to communicate with other developers, read the documentation, and analyze the source code. Many studies show that developers focus primarily on the source code and that small improvements can have a strong impact. As such, it is crucial to bring the code itself into a more comprehensible form. A particular technique for this purpose are explicit feature traces to easily identify a program’s functionalities. To improve our empirical understanding about the effects of feature traces, we report an online experiment with 49 professional software developers. We studied the impact of explicit feature traces, namely annotations and decomposition, on program comprehension and compared them to the same code without traces. Besides this experiment, we also asked our participants about their opinions in order to combine quantitative and qualitative data. Our results indicate that, as opposed to purely object-oriented code: (1) annotations can have positive effects on program comprehension; (2) decomposition can have a negative impact on bug localization; and (3) our participants perceive both techniques as beneficial. Moreover, none of the three code versions yields significant improvements on task completion time. Overall, our results indicate that lightweight traceability, such as using annotations, provides immediate benefits to developers during software development and maintenance without extensive training or tooling; and can improve current industrial practices that rely on heavyweight traceability tools (e.g., DOORS) and retroactive fulfillment of standards (e.g., ISO-26262, DO-178B).

What the fork: a study of inefficient and efficient forking practices in social coding

Forking and pull requests have been widely used in open-source communities as a uniform development and contribution mechanism, giving developers the flexibility to modify their own fork without affecting others before attempting to contribute back. However, not all projects use forks efficiently; many experience lost and duplicate contributions and fragmented communities. In this paper, we explore how open-source projects on GitHub differ with regard to forking inefficiencies. First, we observed that different communities experience these inefficiencies to widely different degrees and interviewed practitioners to understand why. Then, using multiple regression modeling, we analyzed which context factors correlate with fewer inefficiencies.We found that better modularity and centralized management are associated with more contributions and a higher fraction of accepted pull requests, suggesting specific best practices that project maintainers can adopt to reduce forking-related inefficiencies in their communities.

ServDroid: detecting service usage inefficiencies in Android applications

Services in Android applications are frequently-used components for performing time-consuming operations in the background. While services play a crucial role in the app performance, our study shows that service uses in practice are not as efficient as expected, e.g., they tend to cause unnecessary resource occupation and/or energy consumption. Moreover, as service usage inefficiencies do not manifest with immediate failures, e.g., app crashes, existing testing-based approaches fall short in finding them. In this paper, we identify four anti-patterns of such service usage inefficiency bugs, including premature create, late destroy, premature destroy, and service leak, and present a static analysis technique, ServDroid, to automatically and effectively detect them based on the anti-patterns. We have applied ServDroid to a large collection of popular real-world Android apps. Our results show that, surprisingly, service usage inefficiencies are prevalent and can severely impact the app performance.

Together strong: cooperative Android app analysis

Recent years have seen the development of numerous tools for the analysis of taint flows in Android apps. Taint analyses aim at detecting data leaks, accidentally or by purpose programmed into apps. Often, such tools specialize in the treatment of specific features impeding precise taint analysis (like reflection or inter-app communication). This multitude of tools, their specific applicability and their various combination options complicate the selection of a tool (or multiple tools) when faced with an analysis instance, even for knowledgeable users, and hence hinders the successful adoption of taint analyses.

In this work, we thus present CoDiDroid, a framework for cooperative Android app analysis. CoDiDroid (1) allows users to ask questions about flows in apps in varying degrees of detail, (2) automatically generates subtasks for answering such questions, (3) distributes tasks onto analysis tools (currently DroidRA, FlowDroid, HornDroid, IC3 and two novel tools) and (4) at the end merges tool answers on subtasks into an overall answer. Thereby, users are freed from having to learn about the use and functionality of all these tools while still being able to leverage their capabilities. Moreover, we experimentally show that cooperation among tools pays off with respect to effectiveness, precision and scalability.

A framework for writing trigger-action todo comments in executable format

Natural language elements, e.g., todo comments, are frequently used to communicate among developers and to describe tasks that need to be performed (actions) when specific conditions hold on artifacts related to the code repository (triggers), e.g., from the Apache Struts project: “remove expectedJDK15 and if() after switching to Java 1.6”. As projects evolve, development processes change, and development teams reorganize, these comments, because of their informal nature, frequently become irrelevant or forgotten. We present the first framework, dubbed TrigIt, to specify trigger-action todo comments in executable format. Thus, actions are executed automatically when triggers evaluate to true. TrigIt specifications are written in the host language (e.g., Java) and are evaluated as part of the build process. The triggers are specified as query statements over abstract syntax trees, abstract representation of build configuration scripts, issue tracking systems, and system clock time. The actions are either notifications to developers or code transformation steps. We implemented TrigIt for the Java programming language and migrated 44 existing trigger-action comments from several popular open-source projects. Evaluation of TrigIt, via a user study, showed that users find TrigIt easy to learn and use. TrigIt has the potential to enforce more discipline in writing and maintaining comments in large code repositories.

Decomposing the rationale of code commits: the software developer’s perspective

Communicating the rationale behind decisions is essential for the success of software engineering projects. In particular, understanding the rationale of code commits is an important and often difficult task. We posit that part of such difficulty lies in rationale often being treated as a single piece of information. In this paper, we set to discover the breakdown of components in which developers decompose the rationale of code commits in the context of software maintenance, and to understand their experience with it and with its individual components. For this goal, we apply a mixed-methods approach, interviewing 20 software developers to ask them how they decompose rationale, and surveying an additional 24 developers to understand their experiences needing, finding, and recording those components. We found that developers decompose the rationale of code commits into 15 components, each of which is differently needed, found, and recorded. These components are: goal, need, benefits, constraints, alternatives, selected alternative, dependencies, committer, time, location, modifications, explanation of modifications, validation, maturity stage, and side effects. Our findings provide multiple implications. Educators can now disseminate the multiple dimensions and importance of the rationale of code commits. For practitioners, our decomposition of rationale defines a "common vocabulary" to use when discussing rationale of code commits, which we expect to strengthen the quality of their rationale sharing and documentation process. For researchers, our findings enable techniques for automatically assessing, improving, and generating rationale of code commits to specifically target the components that developers need.

Model-based testing of breaking changes in Node.js libraries

Semantic versioning is widely used by library developers to indicate whether updates contain changes that may break existing clients. Especially for dynamic languages like JavaScript, using semantic versioning correctly is known to be difficult, which often causes program failures and makes client developers reluctant to switch to new library versions.

The concept of type regression testing has recently been introduced as an automated mechanism to assist the JavaScript library developers. That mechanism is effective for detecting breaking changes in widely used libraries, but it suffers from scalability limitations that make it slow and also less useful for libraries that do not have many available clients.

This paper presents a model-based variant of type regression testing. Instead of comparing API models of a library before and after an update, it finds breaking changes by automatically generating tests from a reusable API model. Experiments show that this new approach significantly improves scalability: it runs faster, and it can find breaking changes in more libraries.

Monitoring-aware IDEs

Engineering modern large-scale software requires software developers to not solely focus on writing code, but also to continuously examine monitoring data to reason about the dynamic behavior of their systems. These additional monitoring responsibilities for developers have only emerged recently, in the light of DevOps culture. Interestingly, software development activities happen mainly in the IDE, while reasoning about production monitoring happens in separate monitoring tools. We propose an approach that integrates monitoring signals into the development environment and workflow. We conjecture that an IDE with such capability improves the performance of developers as time spent continuously context switching from development to monitoring would be eliminated. This paper takes a first step towards understanding the benefits of a possible monitoring-aware IDE. We implemented a prototype of a Monitoring-Aware IDE, connected to the monitoring systems of Adyen, a large-scale payment company that performs intense monitoring in their software systems. Given our results, we firmly believe that monitoring-aware IDEs can play an essential role in improving how developers perform monitoring.

Going big: a large-scale study on what big data developers ask

Software developers are increasingly required to write big data code. However, they find big data software development challenging. To help these developers it is necessary to understand big data topics that they are interested in and the difficulty of finding answers for questions in these topics. In this work, we conduct a large-scale study on Stackoverflow to understand the interest and difficulties of big data developers. To conduct the study, we develop a set of big data tags to extract big data posts from Stackoverflow; use topic modeling to group these posts into big data topics; group similar topics into categories to construct a topic hierarchy; analyze popularity and difficulty of topics and their correlations; and discuss implications of our findings for practice, research and education of big data software development and investigate their coincidence with the findings of previous work.

Why aren’t regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions

This paper explores the extent to which regular expressions (regexes) are portable across programming languages. Many languages offer similar regex syntaxes, and it would be natural to assume that regexes can be ported across language boundaries. But can regexes be copy/pasted across language boundaries while retaining their semantic and performance characteristics?

In our survey of 158 professional software developers, most indicated that they re-use regexes across language boundaries and about half reported that they believe regexes are a universal language.We experimentally evaluated the riskiness of this practice using a novel regex corpus — 537,806 regexes from 193,524 projects written in JavaScript, Java, PHP, Python, Ruby, Go, Perl, and Rust. Using our polyglot regex corpus, we explored the hitherto-unstudied regex portability problems: logic errors due to semantic differences, and security vulnerabilities due to performance differences.

We report that developers’ belief in a regex lingua franca is understandable but unfounded. Though most regexes compile across language boundaries, 15% exhibit semantic differences across languages and 10% exhibit performance differences across languages. We explained these differences using regex documentation, and further illuminate our findings by investigating regex engine implementations. Along the way we found bugs in the regex engines of JavaScript-V8, Python, Ruby, and Rust, and potential semantic and performance regex bugs in thousands of modules.

Nodest: feedback-driven static analysis of Node.js applications

Node.js provides the ability to write JavaScript programs for the server-side and has become a popular language for developing web applications. Node.js allows direct access to the underlying filesystem, operating system resources, and databases, but does not provide any security mechanism such as sandboxing of untrusted code, and injection vulnerabilities are now commonly reported in Node.js modules. Existing static dataflow analysis techniques do not scale to Node.js applications to find injection vulnerabilities because small Node.js web applications typically depend on many third-party modules. We present a new feedback-driven static analysis that scales well to detect injection vulnerabilities in Node.js applications. The key idea behind our new technique is that not all third-party modules need to be analyzed to detect an injection vulnerability. Results of running our analysis, Nodest, on real-world Node.js applications show that the technique scales to large applications and finds previously known as well as new vulnerabilities. In particular, Nodest finds 63 true positive taint flows in a set of our benchmarks, whereas a state-of-the-art static analysis reports 3 only. Moreover, our analysis scales to Express, the most popular Node.js web framework, and reports non-trivial injection vulnerabilities.

Effective error-specification inference via domain-knowledge expansion

Error-handling code responds to the occurrence of runtime errors. Failure to correctly handle errors can lead to security vulnerabilities and data loss. This paper deals with error handling in software written in C that uses the return-code idiom: the presence and type of error is encoded in the return value of a function. This paper describes EESI, a static analysis that infers the set of values that a function can return on error. Such a function error-specification can then be used to identify bugs related to incorrect error handling. The key insight of EESI is to bootstrap the analysis with domain knowledge related to error handling provided by a developer. EESI uses a combination of intraprocedural, flow-sensitive analysis and interprocedural, context-insensitive analysis to ensure precision and scalability. We built a tool ECC to demonstrate how the function error-specifications inferred by EESI can be used to automatically find bugs related to incorrect error handling. ECC detected 246 bugs across 9 programs, of which 110 have been confirmed. ECC detected 220 previously unknown bugs, of which 99 are confirmed. Two patches have already been merged into OpenSSL.

DeepStellar: model-based quantitative analysis of stateful deep learning systems

Deep Learning (DL) has achieved tremendous success in many cutting-edge applications. However, the state-of-the-art DL systems still suffer from quality issues. While some recent progress has been made on the analysis of feed-forward DL systems, little study has been done on the Recurrent Neural Network (RNN)-based stateful DL systems, which are widely used in audio, natural languages and video processing, etc. In this paper, we initiate the very first step towards the quantitative analysis of RNN-based DL systems. We model RNN as an abstract state transition system to characterize its internal behaviors. Based on the abstract model, we design two trace similarity metrics and five coverage criteria which enable the quantitative analysis of RNNs. We further propose two algorithms powered by the quantitative measures for adversarial sample detection and coverage-guided test generation. We evaluate DeepStellar on four RNN-based systems covering image classification and automated speech recognition. The results demonstrate that the abstract model is useful in capturing the internal behaviors of RNNs, and confirm that (1) the similarity metrics could effectively capture the differences between samples even with very small perturbations (achieving 97% accuracy for detecting adversarial samples) and (2) the coverage criteria are useful in revealing erroneous behaviors (generating three times more adversarial samples than random testing and hundreds times more than the unrolling approach).

REINAM: reinforcement learning for input-grammar inference

Program input grammars (i.e., grammars encoding the language of valid program inputs) facilitate a wide range of applications in software engineering such as symbolic execution and delta debugging. Grammars synthesized by existing approaches can cover only a small part of the valid input space mainly due to unanalyzable code (e.g., native code) in programs and lacking high-quality and high-variety seed inputs. To address these challenges, we present REINAM, a reinforcement-learning approach for synthesizing probabilistic context-free program input grammars without any seed inputs. REINAM uses an industrial symbolic execution engine to generate an initial set of inputs for the given target program, and then uses an iterative process of grammar generalization to proactively generate additional inputs to infer grammars generalized from these initial seed inputs. To efficiently search for target generalizations in a huge search space of candidate generalization operators, REINAM includes a novel formulation of the search problem as a reinforcement learning problem. Our evaluation on eleven real-world benchmarks shows that REINAM outperforms an existing state-of-the-art approach on precision and recall of synthesized grammars, and fuzz testing based on REINAM substantially increases the coverage of the space of valid inputs. REINAM is able to synthesize a grammar covering the entire valid input space for some benchmarks without decreasing the accuracy of the grammar.

Boosting operational DNN testing efficiency through conditioning

With the increasing adoption of Deep Neural Network (DNN) models as integral parts of software systems, efficient operational testing of DNNs is much in demand to ensure these models' actual performance in field conditions. A challenge is that the testing often needs to produce precise results with a very limited budget for labeling data collected in field.

Viewing software testing as a practice of reliability estimation through statistical sampling, we re-interpret the idea behind conventional structural coverages as conditioning for variance reduction. With this insight we propose an efficient DNN testing method based on the conditioning on the representation learned by the DNN model under testing. The representation is defined by the probability distribution of the output of neurons in the last hidden layer of the model. To sample from this high dimensional distribution in which the operational data are sparsely distributed, we design an algorithm leveraging cross entropy minimization.

Experiments with various DNN models and datasets were conducted to evaluate the general efficiency of the approach. The results show that, compared with simple random sampling, this approach requires only about a half of labeled inputs to achieve the same level of precision.

A comprehensive study on deep learning bug characteristics

Deep learning has gained substantial popularity in recent years. Developers mainly rely on libraries and tools to add deep learning capabilities to their software. What kinds of bugs are frequently found in such software? What are the root causes of such bugs? What impacts do such bugs have? Which stages of deep learning pipeline are more bug prone? Are there any antipatterns? Understanding such characteristics of bugs in deep learning software has the potential to foster the development of better deep learning platforms, debugging mechanisms, development practices, and encourage the development of analysis and verification frameworks. Therefore, we study 2716 high-quality posts from Stack Overflow and 500 bug fix commits from Github about five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand the types of bugs, root causes of bugs, impacts of bugs, bug-prone stage of deep learning pipeline as well as whether there are some common antipatterns found in this buggy software. The key findings of our study include: data bug and logic bug are the most severe bug types in deep learning software appearing more than 48% of the times, major root causes of these bugs are Incorrect Model Parameter (IPS) and Structural Inefficiency (SI) showing up more than 43% of the times.We have also found that the bugs in the usage of deep learning libraries have some common antipatterns.

Just fuzz it: solving floating-point constraints using coverage-guided fuzzing

We investigate the use of coverage-guided fuzzing as a means of proving satisfiability of SMT formulas over finite variable domains, with specific application to floating-point constraints. We show how an SMT formula can be encoded as a program containing a location that is reachable if and only if the program’s input corresponds to a satisfying assignment to the formula. A coverage-guided fuzzer can then be used to search for an input that reaches the location, yielding a satisfying assignment. We have implemented this idea in a tool, Just Fuzz-it Solver (JFS), and we present a large experimental evaluation showing that JFS is both competitive with and complementary to state-of-the-art SMT solvers with respect to solving floating-point constraints, and that the coverage-guided approach of JFS provides significant benefit over naive fuzzing in the floating-point domain. Applied in a portfolio manner, the JFS approach thus has the potential to complement traditional SMT solvers for program analysis tasks that involve reasoning about floating-point constraints.

Cerebro: context-aware adaptive fuzzing for effective vulnerability detection

Existing greybox fuzzers mainly utilize program coverage as the goal to guide the fuzzing process. To maximize their outputs, coverage-based greybox fuzzers need to evaluate the quality of seeds properly, which involves making two decisions: 1) which is the most promising seed to fuzz next (seed prioritization), and 2) how many efforts should be made to the current seed (power scheduling). In this paper, we present our fuzzer, Cerebro, to address the above challenges. For the seed prioritization problem, we propose an online multi-objective based algorithm to balance various metrics such as code complexity, coverage, execution time, etc. To address the power scheduling problem, we introduce the concept of input potential to measure the complexity of uncovered code and propose a cost-effective algorithm to update it dynamically. Unlike previous approaches where the fuzzer evaluates an input solely based on the execution traces that it has covered, Cerebro is able to foresee the benefits of fuzzing the input by adaptively evaluating its input potential. We perform a thorough evaluation for Cerebro on 8 different real-world programs. The experiments show that Cerebro can find more vulnerabilities and achieve better coverage than state-of-the-art fuzzers such as AFL and AFLFast.

iFixFlakies: a framework for automatically fixing order-dependent flaky tests

Regression testing provides important pass or fail signals that developers use to make decisions after code changes. However, flaky tests, which pass or fail even when the code has not changed, can mislead developers. A common kind of flaky tests are order-dependent tests, which pass or fail depending on the order in which the tests are run. Fixing order-dependent tests is often tedious and time-consuming.

We propose iFixFlakies, a framework for automatically fixing order-dependent tests. The key insight in iFixFlakies is that test suites often already have tests, which we call helpers, whose logic resets or sets the states for order-dependent tests to pass. iFixFlakies searches a test suite for helpers that make the order-dependent tests pass and then recommends patches for the order-dependent tests using code from these helpers. Our evaluation on 110 truly orderdependent tests from a public dataset shows that 58 of them have helpers, and iFixFlakies can fix all 58. We opened pull requests for 56 order-dependent tests (2 of 58 were already fixed), and developers have already accepted pull requests for 21 of them, with all the remaining ones still pending.

Binary reduction of dependency graphs

Delta debugging is a technique for reducing a failure-inducing input to a small input that reveals the cause of the failure. This has been successful for a wide variety of inputs including C programs, XML data, and thread schedules. However, for input that has many internal dependencies, delta debugging scales poorly. Such input includes C#, Java, and Java bytecode and they have presented a major challenge for input reduction until now. In this paper, we show that the core challenge is a reduction problem for dependency graphs, and we present a general strategy for reducing such graphs. We combine this with a novel algorithm for reduction called Binary Reduction in a tool called J-Reduce for Java bytecode. Our experiments show that our tool is 12x faster and achieves more reduction than delta debugging on average. This enabled us to create and submit short bug reports for three Java bytecode decompilers.

AggrePlay: efficient record and replay of multi-threaded programs

Deterministic replay presents challenges and often results in high memory and runtime overheads. Previous studies deterministically reproduce program outputs often only after several replay iterations or may produce a non-deterministic sequence of output to external sources. In this paper, we propose AggrePlay, a deterministic replay technique which is based on recording read-write interleavings leveraging thread-local determinism and summarized read values. During the record phase, AggrePlay records a read count vector clock for each thread on each memory location. Each thread checks the logged vector clock against the current read count in the replay phase before a write event. We present an experiment and analyze the results using the Splash2x benchmark suite as well as two real-world applications. The experimental results show that on average, AggrePlay experiences a better reduction in compressed log size, and 56% better runtime slowdown during the record phase, as well as a 41.58% higher probability in the replay phase than existing work.

The review linkage graph for code review analytics: a recovery approach and empirical study

Modern Code Review (MCR) is a pillar of contemporary quality assurance approaches, where developers discuss and improve code changes prior to integration. Since review interactions (e.g., comments, revisions) are archived, analytics approaches like reviewer recommendation and review outcome prediction have been proposed to support the MCR process. These approaches assume that reviews evolve and are adjudicated independently; yet in practice, reviews can be interdependent.

In this paper, we set out to better understand the impact of review linkage on code review analytics. To do so, we extract review linkage graphs where nodes represent reviews, while edges represent recovered links between reviews. Through a quantitative analysis of six software communities, we observe that (a) linked reviews occur regularly, with linked review rates of 25% in OpenStack, 17% in Chromium, and 3%–8% in Android, Qt, Eclipse, and Libreoffice; and (b) linkage has become more prevalent over time. Through qualitative analysis, we discover that links span 16 types that belong to five categories. To automate link category recovery, we train classifiers to label links according to the surrounding document content. Those classifiers achieve F1-scores of 0.71–0.79, at least doubling the F1-scores of a ZeroR baseline. Finally, we show that the F1-scores of reviewer recommenders can be improved by 37%–88% (5–14 percentage points) by incorporating information from linked reviews that is available at prediction time. Indeed, review linkage should be exploited by future code review analytics.

Mitigating power side channels during compilation

The code generation modules inside modern compilers, which use a limited number of CPU registers to store a large number of program variables, may introduce side-channel leaks even in software equipped with state-of-the-art countermeasures. We propose a program analysis and transformation based method to eliminate such leaks. Our method has a type-based technique for detecting leaks, which leverages Datalog-based declarative analysis and domain-specific optimizations to achieve high efficiency and accuracy. It also has a mitigation technique for the compiler's backend, more specifically the register allocation modules, to ensure that leaky intermediate computation results are stored in different CPU registers or memory locations. We have implemented and evaluated our method in LLVM for the x86 instruction set architecture. Our experiments on cryptographic software show that the method is effective in removing the side channel while being efficient, i.e., our mitigated code is more compact and runs faster than code mitigated using state-of-the-art techniques.

Maximal multi-layer specification synthesis

There has been a significant interest in applying programming-by-example to automate repetitive and tedious tasks. However, due to the incomplete nature of input-output examples, a synthesizer may generate programs that pass the examples but do not match the user intent. In this paper, we propose MARS, a novel synthesis framework that takes as input a multi-layer specification composed by input-output examples, textual description, and partial code snippets that capture the user intent. To accurately capture the user intent from the noisy and ambiguous description, we propose a hybrid model that combines the power of an LSTM-based sequence-to-sequence model with the apriori algorithm for mining association rules through unsupervised learning. We reduce the problem of solving a multi-layer specification synthesis to a Max-SMT problem, where hard constraints encode well-typed concrete programs and soft constraints encode the user intent learned by the hybrid model. We instantiate our hybrid model to the data wrangling domain and compare its performance against Morpheus, a state-of-the-art synthesizer for data wrangling tasks. Our experiments demonstrate that our approach outperforms MORPHEUS in terms of running time and solved benchmarks. For challenging benchmarks, our approach can suggest candidates with rankings that are an order of magnitude better than MORPHEUS which leads to running times that are 15x faster than MORPHEUS.

Phoenix: automated data-driven synthesis of repairs for static analysis violations

Traditional automatic program repair (APR) tools rely on a test-suite as a repair specification. But test suites even when available are not of specification quality, limiting the performance and hence viability of test-suite based repair. On the other hand, static analysis-based bug finding tools are seeing increasing adoption in industry but still face challenges since the reported violations are viewed as not easily actionable. We propose a novel solution that solves both these challenges through a technique for automatically generating high-quality patches for static analysis violations by learning from examples. Our approach uses the static analyzer as an oracle and does not require a test suite. We realize our solution in a system, Phoenix, that implements a fully-automated pipeline that mines and cleans patches for static analysis violations from the wild, learns generalized executable repair strategies as programs in a novel Domain Specific Language (DSL), and then instantiates concrete repairs from them on new unseen violations. Using Phoenix we mine a corpus of 5,389 unique violations and patches from 517 Github projects. In a cross-validation study on this corpus Phoenix successfully produced 4,596 bug-fixes, with a recall of 85% and a precision of 54%. When applied to the latest revisions of a further5 Github projects, Phoenix produced 94 correct patches to previously unknown bugs, 19 of which have already been accepted and merged by the development teams. To the best of our knowledge this constitutes, by far the largest application of any automatic patch generation technology to large-scale real-world systems

Black box fairness testing of machine learning models

Any given AI system cannot be accepted unless its trustworthiness is proven. An important characteristic of a trustworthy AI system is the absence of algorithmic bias. 'Individual discrimination' exists when a given individual different from another only in 'protected attributes' (e.g., age, gender, race, etc.) receives a different decision outcome from a given machine learning (ML) model as compared to the other individual. The current work addresses the problem of detecting the presence of individual discrimination in given ML models. Detection of individual discrimination is test-intensive in a black-box setting, which is not feasible for non-trivial systems. We propose a methodology for auto-generation of test inputs, for the task of detecting individual discrimination. Our approach combines two well-established techniques - symbolic execution and local explainability for effective test case generation. We empirically show that our approach to generate test cases is very effective as compared to the best-known benchmark systems that we examine.

Java reflection API: revealing the dark side of the mirror

Developers of widely used Java Virtual Machines (JVMs) implement and test the Java Reflection API based on a Javadoc, which is specified using a natural language. However, there is limited knowledge on whether Java Reflection API developers are able to systematically reveal i) underdetermined specifications; and ii) non-conformances between their implementation and the Javadoc. Moreover, current automatic test suite generators cannot be used to detect them. To better understand the problem, we analyze test suites of two widely used JVMs, and we conduct a survey with 130 developers who use the Java Reflection API to see whether the Javadoc impacts on their understanding. We also propose a technique to detect underdetermined specifications and non-conformances between the Javadoc and the implementations of the Java Reflection API. It automatically creates test cases, and executes them using different JVMs. Then, we manually execute some steps to identify underdetermined specifications and to confirm whether a non-conformance candidate is indeed a bug. We evaluate our technique in 439 input programs. Our technique identifies underdetermined specification and non-conformance candidates in 32 Java Reflection API public methods of 7 classes. We report underdetermined specification candidates in 12 Java Reflection API methods. Java Reflection API specifiers accept 3 underdetermined specification candidates (25%). We also report 24 non-conformance candidates to Eclipse OpenJ9 JVM, and 7 to Oracle JVM. Eclipse OpenJ9 JVM developers accept and fix 21 candidates (87.5%), and Oracle JVM developers accept 5 and fix 4 non-conformance candidates.

A conceptual replication of continuous integration pain points in the context of Travis CI

Continuous integration (CI) is an established software quality assurance practice, and the focus of much prior research with a diverse range of methods and populations. In this paper, we first conduct a literature review of 37 papers on CI pain points. We then conduct a conceptual replication study on results from these papers using a triangulation design consisting of a survey with 132 responses, 12 interviews, and two logistic regressions predicting Travis CI abandonment and switching on a dataset of 6,239 GitHub projects. We report and discuss which past results we were able to replicate, those for which we found conflicting evidence, those for which we did not find evidence, and the implications of these findings.

Ethnographic research in software engineering: a critical review and checklist

Software Engineering (SE) community has recently been investing significant amount of effort in qualitative research to study the human and social aspects of SE processes, practices, and technologies. Ethnography is one of the major qualitative research methods, which is based on constructivist paradigm that is different from the hypothetic-deductive research model usually used in SE. Hence, the adoption of ethnographic research method in SE can present significant challenges in terms of sufficient understanding of the methodological requirements and the logistics of its applications. It is important to systematically identify and understand various aspects of adopting ethnography in SE and provide effective guidance. We carried out an empirical inquiry by integrating a systematic literature review and a confirmatory survey. By reviewing the ethnographic studies reported in 111 identified papers and 26 doctoral theses and analyzing the authors' responses of 29 of those papers, we revealed several unique insights. These identified insights were then transformed into a preliminary checklist that helps improve the state-of-the-practice of using ethnography in SE. This study also identifies the areas where methodological improvements of ethnography are needed in SE.

Achilles’ heel of plug-and-Play software architectures: a grounded theory based approach

Through a set of well-defined interfaces, plug-and-play architectures enable additional functionalities to be added or removed from a system at its runtime. However, plug-ins can also increase the application’s attack surface or introduce untrusted behavior into the system. In this paper, we (1) use a grounded theory-based approach to conduct an empirical study of common vulnerabilities in plug-and-play architectures; (2) conduct a systematic literature survey and evaluate the extent that the results of the empirical study are novel or supported by the literature; (3) evaluate the practicality of the findings by interviewing practitioners with several years of experience in plug-and-play systems. By analyzing Chromium, Thunderbird, Firefox, Pidgin, WordPress, Apache OfBiz, and OpenMRS, we found a total of 303 vulnerabilities rooted in extensibility design decisions and observed that these plugin-related vulnerabilities were caused by 16 different types of vulnerabilities. Out of these 16 vulnerability types we identified 19 mitigation procedures for fixing them. The literature review supported 12 vulnerability types and 8 mitigation techniques discovered in our empirical study, and indicated that 5 mitigation techniques were not covered in our empirical study. Furthermore, it indicated that 4 vulnerability types and 11 mitigation techniques discovered in our empirical study were not covered in the literature. The interviews with practitioners confirmed the relevance of the findings and highlighted ways that the results of this empirical study can have an impact in practice.

Latent error prediction and fault localization for microservice applications by learning from system trace logs

In the production environment, a large part of microservice failures are related to the complex and dynamic interactions and runtime environments, such as those related to multiple instances, environmental configurations, and asynchronous interactions of microservices. Due to the complexity and dynamism of these failures, it is often hard to reproduce and diagnose them in testing environments. It is desirable yet still challenging that these failures can be detected and the faults can be located at runtime of the production environment to allow developers to resolve them efficiently. To address this challenge, in this paper, we propose MEPFL, an approach of latent error prediction and fault localization for microservice applications by learning from system trace logs. Based on a set of features defined on the system trace logs, MEPFL trains prediction models at both the trace level and the microservice level using the system trace logs collected from automatic executions of the target application and its faulty versions produced by fault injection. The prediction models thus can be used in the production environment to predict latent errors, faulty microservices, and fault types for trace instances captured at runtime. We implement MEPFL based on the infrastructure systems of container orchestrator and service mesh, and conduct a series of experimental studies with two opensource microservice applications (one of them being the largest open-source microservice application to our best knowledge). The results indicate that MEPFL can achieve high accuracy in intraapplication prediction of latent errors, faulty microservices, and fault types, and outperforms a state-of-the-art approach of failure diagnosis for distributed systems. The results also show that MEPFL can effectively predict latent errors caused by real-world fault cases.

The importance of accounting for real-world labelling when predicting software vulnerabilities

Previous work on vulnerability prediction assume that predictive models are trained with respect to perfect labelling information (includes labels from future, as yet undiscovered vulnerabilities). In this paper we present results from a comprehensive empirical study of 1,898 real-world vulnerabilities reported in 74 releases of three security-critical open source systems (Linux Kernel, OpenSSL and Wiresark). Our study investigates the effectiveness of three previously proposed vulnerability prediction approaches, in two settings: with and without the unrealistic labelling assumption. The results reveal that the unrealistic labelling assumption can profoundly mis- lead the scientific conclusions drawn; suggesting highly effective and deployable prediction results vanish when we fully account for realistically available labelling in the experimental methodology. More precisely, MCC mean values of predictive effectiveness drop from 0.77, 0.65 and 0.43 to 0.08, 0.22, 0.10 for Linux Kernel, OpenSSL and Wiresark, respectively. Similar results are also obtained for precision, recall and other assessments of predictive efficacy. The community therefore needs to upgrade experimental and empirical methodology for vulnerability prediction evaluation and development to ensure robust and actionable scientific findings.

Detecting concurrency memory corruption vulnerabilities

Memory corruption vulnerabilities can occur in multithreaded executions, known as concurrency vulnerabilities in this paper. Due to non-deterministic multithreaded executions, they are extremely difficult to detect. Recently, researchers tried to apply data race detectors to detect concurrency vulnerabilities. Unfortunately, these detectors are ineffective on detecting concurrency vulnerabilities. For example, most (90%) of data races are benign. However, concurrency vulnerabilities are harmful and can usually be exploited to launch attacks. Techniques based on maximal causal model rely on constraints solvers to predict scheduling; they can miss concurrency vulnerabilities in practice. Our insight is, a concurrency vulnerability is more related to the orders of events that can be reversed in different executions, no matter whether the corresponding accesses can form data races. We then define exchangeable events to identify pairs of events such that their execution orders can be probably reversed in different executions. We further propose algorithms to detect three major kinds of concurrency vulnerabilities. To overcome potential imprecision of exchangeable events, we also adopt a validation to isolate real vulnerabilities. We implemented our algorithms as a tool ConVul and applied it on 10 known concurrency vulnerabilities and the MySQL database server. Compared with three widely-used race detectors and one detector based on maximal causal model, ConVul was significantly more effective by detecting 9 of 10 known vulnerabilities and 6 zero-day vulnerabilities on MySQL (four have been confirmed). However, other detectors only detected at most 3 out of the 16 known and zero-day vulnerabilities.

Locating vulnerabilities in binaries via memory layout recovering

Locating vulnerabilities is an important task for security auditing, exploit writing, and code hardening. However, it is challenging to locate vulnerabilities in binary code, because most program semantics (e.g., boundaries of an array) is missing after compilation. Without program semantics, it is difficult to determine whether a memory access exceeds its valid boundaries in binary code. In this work, we propose an approach to locate vulnerabilities based on memory layout recovery. First, we collect a set of passed executions and one failed execution. Then, for passed and failed executions, we restore their program semantics by recovering fine-grained memory layouts based on the memory addressing model. With the memory layouts recovered in passed executions as reference, we can locate vulnerabilities in failed execution by memory layout identification and comparison. Our experiments show that the proposed approach is effective to locate vulnerabilities on 24 out of 25 DARPA’s CGC programs (96%), and can effectively classifies 453 program crashes (in 5 Linux programs) into 19 groups based on their root causes.

Storm: program reduction for testing and debugging probabilistic programming systems

Probabilistic programming languages offer an intuitive way to model uncertainty by representing complex probability models as simple probabilistic programs. Probabilistic programming systems (PP systems) hide the complexity of inference algorithms away from the program developer. Unfortunately, if a failure occurs during the run of a PP system, a developer typically has very little support in finding the part of the probabilistic program that causes the failure in the system.

This paper presents Storm, a novel general framework for reducing probabilistic programs. Given a probabilistic program (with associated data and inference arguments) that causes a failure in a PP system, Storm finds a smaller version of the program, data, and arguments that cause the same failure. Storm leverages both generic code and data transformations from compiler testing and domain-specific, probabilistic transformations. The paper presents new transformations that reduce the complexity of statements and expressions, reduce data size, and simplify inference arguments (e.g., the number of iterations of the inference algorithm).

We evaluated Storm on 47 programs that caused failures in two popular probabilistic programming systems, Stan and Pyro. Our experimental results show Storm’s effectiveness. For Stan, our minimized programs have 49% less code, 67% less data, and 96% fewer iterations. For Pyro, our minimized programs have 58% less code, 96% less data, and 99% fewer iterations. We also show the benefits of Storm when debugging probabilistic programs.

NullAway: practical type-based null safety for Java

NullPointerExceptions (NPEs) are a key source of crashes in modern Java programs. Previous work has shown how such errors can be prevented at compile time via code annotations and pluggable type checking. However, such systems have been difficult to deploy on large-scale software projects, due to significant build-time overhead and / or a high annotation burden. This paper presents NullAway, a new type-based null safety checker for Java that overcomes these issues. NullAway has been carefully engineered for low overhead, so it can run as part of every build. Further, NullAway reduces annotation burden through targeted unsound assumptions, aiming for no false negatives in practice on checked code. Our evaluation shows that NullAway has significantly lower build-time overhead (1.15×) than comparable tools (2.8-5.1×). Further, on a corpus of production crash data for widely-used Android apps built with NullAway, remaining NPEs were due to unchecked third-party libraries (64%), deliberate error suppressions (17%), or reflection and other forms of post-checking code modification (17%), never due to NullAway’s unsound assumptions for checked code.

Automatically detecting missing cleanup for ungraceful exits

Software encounters ungraceful exits due to either bugs in the interrupt/signal handler code or the intention of developers to debug the software. Users may suffer from ”weird” problems caused by leftovers of the ungraceful exits. A common practice to fix these problems is rebooting, which wipes away the stale state of the software. This solution, however, is heavyweight and often leads to poor user experience because it requires restarting other normal processes. In this paper, we design SafeExit, a tool that can automatically detect and pinpoint the root causes of the problems caused by ungraceful exits, which can help users fix the problems using lightweight solutions. Specifically, SafeExit checks the program exit behaviors in the case of an interrupted execution against its expected exit behaviors to detect the missing cleanup behaviors required for avoiding the ungraceful exit. The expected behaviors are obtained by monitoring the program exit under a normal execution. We apply SafeExit to 38 programs across 10 domains. SafeExit finds 133 types of cleanup behaviors from 36 programs and detects 2861 missing behaviors from 292 interrupted executions. To predict missing behaviors for unseen input scenarios, SafeExit trains prediction models using a set of sampled input scenarios. The results show that SafeExit is accurate with an average F-measure of 92.5%.

Finding and understanding bugs in software model checkers

Software Model Checking (SMC) is a well-known automatic program verification technique and frequently adopted for checking safety-critical software. Thus, the reliability of SMC tools themselves (i.e., software model checkers) is critical. However, little work exists on validating software model checkers, an important problem that this paper tackles by introducing a practical, automated fuzzing technique. For its simplicity and generality, we focus on control-flow reachability (e.g., whether or how many times a branch is reached) and address two specific challenges for effective fuzzing: oracle and scalability. Given a deterministic program, we (1) leverage its concrete executions to synthesize valid branch reachability properties (thus solving the oracle problem) and (2) fuse such individual properties into a single safety property (thus improving the scalability of fuzzing and reducing manual inspection). We have realized our approach as the MCFuzz tool and applied it to extensively test three state-of-the-art C software model checkers, CPAchecker, CBMC, and SeaHorn. MCFuzz has found 62 unique bugs in all three model checkers -- 58 have been confirmed, and 20 have been fixed. We have further analyzed and categorized these bugs (which are diverse), and summarized several lessons for building reliable and robust model checkers. Our testing effort has been well-appreciated by the model checker developers, and also led to improved tool usability and documentation.

A segmented memory model for symbolic execution

Symbolic execution is an effective technique for exploring paths in a program and reasoning about all possible values on those paths. However, the technique still struggles with code that uses complex heap data structures, in which a pointer is allowed to refer to more than one memory object. In such cases, symbolic execution typically forks execution into multiple states, one for each object to which the pointer could refer.

In this paper, we propose a technique that avoids this expensive forking by using a segmented memory model. In this model, memory is split into segments, so that each symbolic pointer refers to objects in a single segment. The size of the segments are bound by a threshold, in order to avoid expensive constraints. This results in a memory model where forking due to symbolic pointer dereferences is significantly reduced, often completely.

We evaluate our segmented memory model on a mix of whole program benchmarks (such as m4 and make) and library benchmarks (such as SQLite), and observe significant decreases in execution time and memory usage.

Releasing fast and slow: an exploratory case study at ING

The appeal of delivering new features faster has led many software projects to adopt rapid releases. However, it is not well understood what the effects of this practice are. This paper presents an exploratory case study of rapid releases at ING, a large banking company that develops software solutions in-house, to characterize rapid releases. Since 2011, ING has shifted to a rapid release model. This switch has resulted in a mixed environment of 611 teams releasing relatively fast and slow. We followed a mixed-methods approach in which we conducted a survey with 461 participants and corroborated their perceptions with 2 years of code quality data and 1 year of release delay data. Our research shows that: rapid releases are more commonly delayed than their non-rapid counterparts, however, rapid releases have shorter delays; rapid releases can be beneficial in terms of reviewing and user-perceived quality; rapidly released software tends to have a higher code churn, a higher test coverage and a lower average complexity; challenges in rapid releases are related to managing dependencies and certain code aspects, e.g., design debt.

SAR: learning cross-language API mappings with little knowledge

To save effort, developers often translate programs from one programming language to another, instead of implementing it from scratch. Translating application program interfaces (APIs) used in one language to functionally equivalent ones available in another language is an important aspect of program translation. Existing approaches facilitate the translation by automatically identifying the API mappings across programming languages. However, these approaches still require large amount of parallel corpora, ranging from pairs of APIs or code fragments that are functionally equivalent, to similar code comments.

To minimize the need of parallel corpora, this paper aims at an automated approach that can map APIs across languages with much less a priori knowledge than other approaches. The approach is based on an realization of the notion of domain adaption, combined with code embedding, to better align two vector spaces. Taking as input large sets of programs, our approach first generates numeric vector representations of the programs (including the APIs used in each language), and it adapts generative adversarial networks (GAN) to align the vectors in different spaces of two languages. For a better alignment, we initialize the GAN with parameters derived from API mapping seeds that can be identified accurately with a simple automatic signature-based matching heuristic. Then the cross language API mappings can be identified via nearest-neighbors queries in the aligned vector spaces. We have implemented the approach (SAR, named after three main technical components in the approach) in a prototype for mapping APIs across Java and C# programs. Our evaluation on about 2 million Java files and 1 million C# files shows that the approach can achieve 54% and 82% mapping accuracy in its top-1 and top-10 API mapping results with only 174 automatically identified seeds, more accurate than other approaches using the same or much more mapping seeds.

Robust log-based anomaly detection on unstable log data

Logs are widely used by large and complex software-intensive systems for troubleshooting. There have been a lot of studies on log-based anomaly detection. To detect the anomalies, the existing methods mainly construct a detection model using log event data extracted from historical logs. However, we find that the existing methods do not work well in practice. These methods have the close-world assumption, which assumes that the log data is stable over time and the set of distinct log events is known. However, our empirical study shows that in practice, log data often contains previously unseen log events or log sequences. The instability of log data comes from two sources: 1) the evolution of logging statements, and 2) the processing noise in log data. In this paper, we propose a new log-based anomaly detection approach, called LogRobust. LogRobust extracts semantic information of log events and represents them as semantic vectors. It then detects anomalies by utilizing an attention-based Bi-LSTM model, which has the ability to capture the contextual information in the log sequences and automatically learn the importance of different log events. In this way, LogRobust is able to identify and handle unstable log events and sequences. We have evaluated LogRobust using logs collected from the Hadoop system and an actual online service system of Microsoft. The experimental results show that the proposed approach can well address the problem of log instability and achieve accurate and robust results on real-world, ever-changing log data.

Pinpointing performance inefficiencies in Java

Many performance inefficiencies such as inappropriate choice of algorithms or data structures, developers' inattention to performance, and missed compiler optimizations show up as wasteful memory operations. Wasteful memory operations are those that produce/consume data to/from memory that may have been avoided. We present, JXPerf, a lightweight performance analysis tool for pinpointing wasteful memory operations in Java programs. Traditional byte code instrumentation for such analysis (1) introduces prohibitive overheads and (2) misses inefficiencies in machine code generation. JXPerf overcomes both of these problems. JXPerf uses hardware performance monitoring units to sample memory locations accessed by a program and uses hardware debug registers to monitor subsequent accesses to the same memory. The result is a lightweight measurement at the machine code level with attribution of inefficiencies to their provenance --- machine and source code within full calling contexts. JXPerf introduces only 7% runtime overhead and 7% memory overhead making it useful in production. Guided by JXPerf, we optimize several Java applications by improving code generation and choosing superior data structures and algorithms, which yield significant speedups.

Understanding flaky tests: the developer’s perspective

Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) despite exercising unchanged code. In this work, we examine the perceptions of software developers about the nature, relevance, and challenges of flaky tests.

We asked 21 professional developers to classify 200 flaky tests they previously fixed, in terms of the nature and the origin of the flakiness, as well as of the fixing effort. We also examined developers' fixing strategies. Subsequently, we conducted an online survey with 121 developers with a median industrial programming experience of five years. Our research shows that: The flakiness is due to several different causes, four of which have never been reported before, despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the identification of the cause for the flakiness. Public preprint [], data and materials [].

SEntiMoji: an emoji-powered learning approach for sentiment analysis in software engineering

Sentiment analysis has various application scenarios in software engineering (SE), such as detecting developers' emotions in commit messages and identifying their opinions on Q&A forums. However, commonly used out-of-the-box sentiment analysis tools cannot obtain reliable results on SE tasks and the misunderstanding of technical jargon is demonstrated to be the main reason. Then, researchers have to utilize labeled SE-related texts to customize sentiment analysis for SE tasks via a variety of algorithms. However, the scarce labeled data can cover only very limited expressions and thus cannot guarantee the analysis quality. To address such a problem, we turn to the easily available emoji usage data for help. More specifically, we employ emotional emojis as noisy labels of sentiments and propose a representation learning approach that uses both Tweets and GitHub posts containing emojis to learn sentiment-aware representations for SE-related texts. These emoji-labeled posts can not only supply the technical jargon, but also incorporate more general sentiment patterns shared across domains. They as well as labeled data are used to learn the final sentiment classifier. Compared to the existing sentiment analysis methods used in SE, the proposed approach can achieve significant improvement on representative benchmark datasets. By further contrast experiments, we find that the Tweets make a key contribution to the power of our approach. This finding informs future research not to unilaterally pursue the domain-specific resource, but try to transform knowledge from the open domain through ubiquitous signals such as emojis.

SESSION: Industry Papers

FinExpert: domain-specific test generation for FinTech systems

To assure high quality of software systems, the comprehensiveness of the created test suite and efficiency of the adopted testing process are highly crucial, especially in the FinTech industry, due to a FinTech system’s complicated system logic, mission-critical nature, and large test suite. However, the state of the testing practice in the FinTech industry still heavily relies on manual efforts. Our recent research efforts contributed our previous approach as the first attempt to automate the testing process in China Foreign Exchange Trade System (CFETS) Information Technology Co. Ltd., a subsidiary of China’s Central Bank that provides China’s foreign exchange transactions, and revealed that automating test generation for such complex trading platform could help alleviate some of these manual efforts. In this paper, we investigate further the dilemmas faced in testing the CFETS trading platform, identify the importance of domain knowledge in its testing process, and propose a new approach of domain-specific test generation to further improve the effectiveness and efficiency of our previous approach in industrial settings. We also present findings of our empirical studies of conducting domain-specific testing on subsystems of the CFETS Trading Platform.

Design diagrams as ontological source

beginabstract In custom software development projects, it is frequently the case that the same type of software is being built for different customers. The deliverables are similar because they address the same market (e.g., Telecom, Banking) or have similar functions or both. However, most organisations do not take advantage of this similarity and conduct each project from scratch leading to lesser margins and lower quality. Our key observation is that the similarity among the projects alludes to the existence of a veritable domain of discourse whose ontology, if created, would make the similarity across the projects explicit. Design diagrams are an integral part of any commercial software project deliverables as they document crucial facets of the software solution. We propose an approach to extract ontological information from UML design diagrams (class and sequence diagrams) and represent it as domain ontology in a convenient representation. This ontology not only helps in developing a better understanding of the domain but also fosters software reuse for future software projects in that domain. Initial results on extracting ontology from thousands of model from public repository show that the created ontologies are accurate and help in better software reuse for new solutions. endabstract

Predicting pull request completion time: a case study on large scale cloud services

Effort estimation models have been long studied in software engineering research. Effort estimation models help organizations and individuals plan and track progress of their software projects and individual tasks to help plan delivery milestones better. Towards this end, there is a large body of work that has been done on effort estimation for projects but little work on an individual checkin (Pull Request) level. In this paper we present a methodology that provides effort estimates on individual developer check-ins which is displayed to developers to help them track their work items. Given the cloud development infrastructure pervasive in companies, it has enabled us to deploy our Pull Request Lifetime prediction system to several thousand developers across multiple software families. We observe from our deployment that the pull request lifetime prediction system conservatively helps save 44.61% of the developer time by accelerating Pull Requests to completion.

TERMINATOR: better automated UI test case prioritization

Automated UI testing is an important component of the continuous integration process of software development. A modern web-based UI is an amalgam of reports from dozens of microservices written by multiple teams. Queries on a page that opens up another will fail if any of that page's microservices fails. As a result, the overall cost for automated UI testing is high since the UI elements cannot be tested in isolation. For example, the entire automated UI testing suite at LexisNexis takes around 30 hours (3-5 hours on the cloud) to execute, which slows down the continuous integration process.

To mitigate this problem and give developers faster feedback on their code, test case prioritization techniques are used to reorder the automated UI test cases so that more failures can be detected earlier. Given that much of the automated UI testing is "black box" in nature, very little information (only the test case descriptions and testing results) can be utilized to prioritize these automated UI test cases. Hence, this paper evaluates 17 "black box" test case prioritization approaches that do not rely on source code information. Among these, we propose a novel TCP approach, that dynamically re-prioritizes the test cases when new failures are detected, by applying and adapting a state of the art framework from the total recall problem. Experimental results on LexisNexis automated UI testing data show that our new approach (which we call TERMINATOR), outperformed prior state of the art approaches in terms of failure detection rates with negligible CPU overhead.

Risks and assets: a qualitative study of a software ecosystem in the mining industry

Digitalization and servitization are impacting many domains, including the mining industry. As the equipment becomes connected and technical infrastructure evolves, business models and risk management need to adapt. In this paper, we present a study on how changes in asset and risk distribution are evolving for the actors in a software ecosystem (SECO) and system-of-systems (SoS) around a mining operation. We have performed a survey to understand how Service Level Agreements (SLAs) -- a common mechanism for managing risk -- are used in other domains. Furthermore, we have performed a focus group study with companies. There is an overall trend in the mining industry to move the investment cost (CAPEX) from the mining operator to the vendors. Hence, the mining operator instead leases the equipment (as operational expense, OPEX) or even acquires a service. This change in business model impacts operation, as knowledge is moved from the mining operator to the suppliers. Furthermore, as the infrastructure becomes more complex, this implies that the mining operator is more and more reliant on the suppliers for the operation and maintenance. As this change is still in an early stage, there is no formalized risk management, e.g. through SLAs, in place. Rather, at present, the companies in the ecosystem rely more on trust and the incentives created by the promise of mutual future benefits of innovation activities. We believe there is a need to better understand how to manage risk in SECO as it is established and evolves. At the same time, in a SECO, the focus is on cooperation and innovation, the companies do not have incentives to address this unless there is an incident. Therefore, industry need, we believe, help in systematically understanding risk and defining quality aspects such as reliability and performance in the new business environment.

Using microservices for non-intrusive customization of multi-tenant SaaS

Enterprise software vendors often need to support their customer companies to customize the enterprise software products deployed on-premises of customers. But when software vendors are migrating their products to cloud-based Software-as-a-Service (SaaS), deep customization that used to be done on-premises is not applicable to the cloud-based multi-tenant context in which all tenants share the same SaaS. Enabling tenant-specific customization in cloud-based multi-tenant SaaS requires a novel approach. This paper proposes a Microservices-based non-intrusive Customization framework for multi-tenant Cloud-based SaaS, called MiSC-Cloud. Non-intrusive deep customization means that the microservices for customization of each tenant are isolated from the main software product and other microservices for customization of other tenants. MiSC-Cloud makes deep customization possible via authorized API calls through API gateways to the APIs of the customization microservices and the APIs of the main software product. We have implemented a proof-of-concept of our approach to enable non-intrusive deep customization of an open-source cloud native reference application of Microsoft called eShopOnContainers. Based on this work, we provide some lessons learned and directions for future work.

Predicting breakdowns in cloud services (with SPIKE)

Maintaining web-services is a mission-critical task where any down- time means loss of revenue and reputation (of being a reliable service provider). In the current competitive web services market, such a loss of reputation causes extensive loss of future revenue.

To address this issue, we developed SPIKE, a data mining tool which can predict upcoming service breakdowns, half an hour into the future. Such predictions let an organization alert and assemble the tiger team to address the problem (e.g. by reconguring cloud hardware in order to reduce the likelihood of that breakdown).

SPIKE utilizes (a) regression tree learning (with CART); (b) synthetic minority over-sampling (to handle how rare spikes are in our data); (c) hyperparameter optimization (to learn best settings for our local data) and (d) a technique we called “topology sampling” where training vectors are built from extensive details of an individual node plus summary details on all their neighbors.

In the experiments reported here, SPIKE predicted service spikes 30 minutes into future with recalls and precision of 75% and above. Also, SPIKE performed relatively better than other widely-used learning methods (neural nets, random forests, logistic regression).

DeepDelta: learning to repair compilation errors

Programmers spend a substantial amount of time manually repairing code that does not compile. We observe that the repairs for any particular error class typically follow a pattern and are highly mechanical. We propose a novel approach that automatically learns these patterns with a deep neural network and suggests program repairs for the most costly classes of build-time compilation failures. We describe how we collect all build errors and the human-authored, in-progress code changes that cause those failing builds to transition to successful builds at Google. We generate an AST diff from the textual code changes and transform it into a domain-specific language called Delta that encodes the change that must be made to make the code compile. We then feed the compiler diagnostic information (as source) and the Delta changes that resolved the diagnostic (as target) into a Neural Machine Translation network for training. For the two most prevalent and costly classes of Java compilation errors, namely missing symbols and mismatched method signatures, our system called DeepDelta, generates the correct repair changes for 19,314 out of 38,788 (50%) of unseen compilation errors. The correct changes are in the top three suggested fixes 86% of the time on average.

WhoDo: automating reviewer suggestions at scale

Today's software development is distributed and involves continuous changes for new features and yet, their development cycle has to be fast and agile. An important component of enabling this agility is selecting the right reviewers for every code-change - the smallest unit of the development cycle. Modern tool-based code review is proven to be an effective way to achieve appropriate code review of software changes. However, the selection of reviewers in these code review systems is at best manual. As software and teams scale, this poses the challenge of selecting the right reviewers, which in turn determines software quality over time. While previous work has suggested automatic approaches to code reviewer recommendations, it has been limited to retrospective analysis. We not only deploy a reviewer suggestions algorithm - WhoDo - and evaluate its effect but also incorporate load balancing as part of it to address one of its major shortcomings: of recommending experienced developers very frequently. We evaluate the effect of this hybrid recommendation + load balancing system on five repositories within Microsoft. Our results are based around various aspects of a commit and how code review affects that. We attempt to quantitatively answer questions which are supposed to play a vital role in effective code review through our data and substantiate it through qualitative feedback of partner repositories.

An IR-based approach towards automated integration of geo-spatial datasets in map-based software systems

Data is arguably the most valuable asset of the modern world. In this era, the success of any data-intensive solution relies on the quality of data that drives it. Among vast amount of data that are captured, managed, and analyzed everyday, geospatial data are one of the most interesting class of data that hold geographical information of real-world phenomena and can be visualized as digital maps. Geo-spatial data is the source of many enterprise solutions that provide local information and insights. Companies often aggregate geospacial datasets from various sources in order to increase the quality of such solutions. However, a lack of a global standard model for geospatial datasets makes the task of merging and integrating datasets difficult and error prone. Traditionally, this aggregation was accomplished by domain experts manually validating the data integration process by merging new data sources and/or new versions of previous data against conflicts and other requirement violations. However, this manual approach is not scalable is a hinder toward rapid release when dealing with big datasets which change frequently. Thus more automated approaches with limited interaction with domain experts is required. As a first step to tackle this problem, we have leveraged Information Retrieval (IR) and geospatial search techniques to propose a systematic and automated conflict identification approach. To evaluate our approach, we conduct a case study in which we measure the accuracy of our approach in several real-world scenarios and followed by interviews with Localintel Inc. software developers to get their feedbacks.

Code coverage at Google

Code coverage is a measure of the degree to which a test suite exercises a software system. Although coverage is well established in software engineering research, deployment in industry is often inhibited by the perceived usefulness and the computational costs of analyzing coverage at scale. At Google, coverage information is computed for one billion lines of code daily, for seven programming languages. A key aspect of making coverage information actionable is to apply it at the level of changesets and code review. This paper describes Google’s code coverage infrastructure and how the computed code coverage information is visualized and used. It also describes the challenges and solutions for adopting code coverage at scale. To study how code coverage is adopted and perceived by developers, this paper analyzes adoption rates, error rates, and average code coverage ratios over a five-year period, and it reports on 512 responses, received from surveying 3000 developers. Finally, this paper provides concrete suggestions for how to implement and use code coverage in an industrial setting.

When deep learning met code search

There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural language queries into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including unsupervised techniques, which rely only on a corpus of code examples, and supervised techniques, which use an aligned corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet.

Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a minimal supervision extension to an existing unsupervised technique.

Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus.

FUDGE: fuzz driver generation at scale

At Google we have found tens of thousands of security and robustness bugs by fuzzing C and C++ libraries. To fuzz a library, a fuzzer requires a fuzz driver—which exercises some library code—to which it can pass inputs. Unfortunately, writing fuzz drivers remains a primarily manual exercise, a major hindrance to the widespread adoption of fuzzing. In this paper, we address this major hindrance by introducing the Fudge system for automated fuzz driver generation. Fudge automatically generates fuzz driver candidates for libraries based on existing client code. We have used Fudge to generate thousands of new drivers for a wide variety of libraries. Each generated driver includes a synthesized C/C++ program and a corresponding build script, and is automatically analyzed for quality. Developers have integrated over 200 of these generated drivers into continuous fuzzing services and have committed to address reported security bugs. Further, several of these fuzz drivers have been upstreamed to open source projects and integrated into the OSS-Fuzz fuzzing infrastructure. Running these fuzz drivers has resulted in over 150 bug fixes, including the elimination of numerous exploitable security vulnerabilities.

Industry practice of coverage-guided enterprise Linux kernel fuzzing

Coverage-guided kernel fuzzing is a widely-used technique that has helped kernel developers and testers discover numerous vulnerabilities. However, due to the high complexity of application and hardware environment, there is little study on deploying fuzzing to the enterprise-level Linux kernel. In this paper, collaborating with the enterprise developers, we present the industry practice to deploy kernel fuzzing on four different enterprise Linux distributions that are responsible for internal business and external services of the company. We have addressed the following outstanding challenges when deploying a popular kernel fuzzer, syzkaller, to these enterprise Linux distributions: coverage support absence, kernel configuration inconsistency, bugs in shallow paths, and continuous fuzzing complexity. This leads to a vulnerability detection of 41 reproducible bugs which are previous unknown in these enterprise Linux kernel and 6 bugs with CVE IDs in U.S. National Vulnerability Database, including flaws that cause general protection fault, deadlock, and use-after-free.

Architectural decision forces at work: experiences in an industrial consultancy setting

The concepts of decision forces and the decision forces viewpoint were proposed to help software architects to make architectural decisions more transparent and the documentation of their rationales more explicit. However, practical experience reports and guidelines on how to use the viewpoint in typical industrial project setups are not available. Existing works mainly focus on basic tool support for the documentation of the viewpoint or show how forces can be used as part of focused architecture review sessions. With this paper, we share experiences and lessons learned from applying the decision forces viewpoint in a distributed industrial project setup, which involves consultants supporting architects during the re-design process of an existing large software system. Alongside our findings, we describe new forces that can serve as template for similar projects, discuss challenges applying them in a distributed consultancy project, and share ideas for potential extensions.

The role of limitations and SLAs in the API industry

As software architecture design is evolving to a microservice paradigm, RESTful APIs are being established as the preferred choice to build applications. In such a scenario, there is a shift towards a growing market of APIs where providers offer different service levels with tailored limitations typically based on the cost.

In this context, while there are well established standards to describe the functional elements of APIs (such as the OpenAPI Specification), having a standard model for Service Level Agreements (SLAs) for APIs may boost an open ecosystem of tools that would represent an improvement for the industry by automating certain tasks during the development such as: SLA-aware scaffolding, SLA-aware testing, or SLA-aware requesters.

Unfortunately, despite there have been several proposals to describe SLAs for software in general and web services in particular during the past decades, there is an actual lack of a widely used standard due to the complex landscape of concepts surrounding the notion of SLAs and the multiple perspectives that can be addressed.

In this paper, we aim to analyze the landscape for SLAs for APIs in two different directions: i) Clarifying the SLA-driven API development lifecycle: its activities and participants; 2) Developing a catalog of relevant concepts and an ulterior prioritization based on different perspectives from both Industry and Academia.

As a main result, we present a scored list of concepts that paves the way to establish a concrete road-map for a standard industry-aligned specification to describe SLAs in APIs.

Evaluating model testing and model checking for finding requirements violations in Simulink models

Matlab/Simulink is a development and simulation language that is widely used by the Cyber-Physical System (CPS) industry to model dynamical systems. There are two mainstream approaches to verify CPS Simulink models: model testing that attempts to identify failures in models by executing them for a number of sampled test inputs, and model checking that attempts to exhaustively check the correctness of models against some given formal properties. In this paper, we present an industrial Simulink model benchmark, provide a categorization of different model types in the benchmark, describe the recurring logical patterns in the model requirements, and discuss the results of applying model checking and model testing approaches to identify requirements violations in the benchmarked models. Based on the results, we discuss the strengths and weaknesses of model testing and model checking. Our results further suggest that model checking and model testing are complementary and by combining them, we can significantly enhance the capabilities of each of these approaches individually. We conclude by providing guidelines as to how the two approaches can be best applied together.

Model checking a C++ software framework: a case study

This paper presents a case study on applying two model checkers, Spin and Divine, to verify key properties of a C++ software framework, known as ADAPRO, originally developed at CERN. Spin was used for verifying properties on the design level. Divine was used for verifying simple test applications that interacted with the implementation. Both model checkers were found to have their own respective sets of pros and cons, but the overall experience was positive. Because both model checkers were used in a complementary manner, they provided valuable new insights into the framework, which would arguably have been hard to gain by traditional testing and analysis tools only. Translating the C++ source code into the modeling language of the Spin model checker helped to find flaws in the original design. With Divine, defects were found in parts of the code base that had already been subject to hundreds of hours of unit tests, integration tests, and acceptance tests. Most importantly, model checking was found to be easy to integrate into the workflow of the software project and bring added value, not only as verification, but also validation methodology. Therefore, using model checking for developing library-level code seems realistic and worth the effort.

Evolving with patterns: a 31-month startup experience report

Software startups develop innovative products under extreme conditions of uncertainty. At the same time they represent a fast-growing sector in the economy and scale up research and technological advancement. This paper describes findings after observing a startup during its first 31 months of life. The data was collected through observations, unstructured interviews as well as from technical and managerial documentation of the startup. The findings are based on a deductive analysis and summarized in 24 contextualized patterns that concern communication, interaction with customer, teamwork, and management. Furthermore, 13 lessons learned are presented with the aim of sharing experience with other startups. This industry report contributes to understanding the applicability and usefulness of startups' patterns, providing valuable knowledge for the startup software engineering community.

Bridging the gap between ML solutions and their business requirements using feature interactions

Machine Learning (ML) based solutions are becoming increasingly popular and pervasive. When testing such solutions, there is a tendency to focus on improving the ML metrics such as the F1-score and accuracy at the expense of ensuring business value and correctness by covering business requirements. In this work, we adapt test planning methods of classical software to ML solutions. We use combinatorial modeling methodology to define the space of business requirements and map it to the ML solution data, and use the notion of data slices to identify the weaker areas of the ML solution and strengthen them. We apply our approach to three real-world case studies and demonstrate its value.

Design thinking in practice: understanding manifestations of design thinking in software engineering

This industry case study explores where and how Design Thinking supports software development teams in their endeavour to create innovative software solutions. Design Thinking has found its way into software companies ranging from startups to SMEs and multinationals. It is mostly seen as a human centered innovation approach or a way to elicit requirements in a more agile fashion. However, research in Design Thinking suggests that being exposed to DT changes the mindset of employees. Thus this article aims to explore the wider use of DT within software companies through a case study in a multinational organization. Our results indicate, that once trained in DT, employees find various ways to implement it not only as a pre-phase to software development but throughout their projects even applying it to aspects of their surroundings such as the development process, team spaces and team work. Specifically we present a model of how DT manifests itself in a software development company.

SESSION: Demonstrations

MOTSD: a multi-objective test selection tool using test suite diagnosability

Performing regression testing on large software systems becomes unfeasible as it takes too long to run all the test cases every time a change is made. The main motivation of this work was to provide a faster and earlier feedback loop to the developers at OutSystems when a change is made. The developed tool, MOTSD, implements a multi-objective test selection approach in a C# code base using a test suite diagnosability metric and historical metrics as objectives and it is powered by a particle swarm optimization algorithm. We present implementation challenges, current experimental results and limitations of the tool when applied in an industrial context. Screencast demo link: <a></a>

BIKER: a tool for Bi-information source based API method recommendation

Application Programming Interfaces (APIs) in software libraries play an important role in modern software development. Although most libraries provide API documentation as a reference, developers may find it difficult to directly search for appropriate APIs in documentation using the natural language description of the programming tasks. We call such phenomenon as knowledge gap, which refers to the fact that API documentation mainly describes API functionality and structure but lacks other types of information like concepts and purposes. In this paper, we propose a Java API recommendation tool named BIKER (Bi-Information source based KnowledgE Recommendation) to bridge the knowledge gap. We implement BIKER as a search engine website. Given a query in natural language, instead of directly searching API documentation, BIKER first searches for similar API-related questions on Stack Overflow to extract candidate APIs. Then, BIKER ranks them by considering the query’s similarity with both Stack Overflow posts and API documentation. Finally, to help developers better understand why each API is recommended and how to use them in practice, BIKER summarizes and presents supplementary information (e.g., API description, code examples in Stack Overflow posts) for each recommended API. Our quantitative evaluation and user study demonstrate that BIKER can help developers find appropriate APIs more efficiently and precisely.

Mart: a mutant generation tool for LLVM

Program mutation makes small syntactic alterations to programs' code in order to artificially create faulty programs (mutants). Mutants creation (generation) tools are often characterized by their mutation operators and the way they create and represent the mutants. This paper presents Mart, a mutants generation tool, for LLVM bitcode, that supports the fine-grained definition of mutation operators (as matching rule - replacing pattern pair; uses 816 defined pairs by default) and the restriction of the code parts to mutate. New operators are implemented in Mart by implementing their matching rules and replacing patterns. Mart also implements in-memory Trivial Compiler Equivalence to eliminate equivalent and duplicate mutants during mutants generation. Mart generates mutant code as separated mutant files, meta-mutants file, weak mutation and mutant coverage instrumented files. Mart is publicly available ( Mart has been applied to generate mutants for several research experiments and generated more than 4,000,000 mutants.

VARYS: an agnostic model-driven monitoring-as-a-service framework for the cloud

Cloud systems are large scalable distributed systems that must be carefully monitored to timely detect problems and anomalies. While a number of cloud monitoring frameworks are available, only a few solutions address the problem of adaptively and dynamically selecting the indicators that must be collected, based on the actual needs of the operator. Unfortunately, these solutions are either limited to infrastructure-level indicators or technology-specific, for instance, they are designed to work with OpenStack but not with other cloud platforms. This paper presents the VARYS monitoring framework, a technology-agnostic Monitoring-as-a-Service solution that can address KPI monitoring at all levels of the Cloud stack, including the application-level. Operators use VARYS to indicate their monitoring goals declaratively, letting the framework to perform all the operations necessary to achieve a requested monitoring configuration automatically. Interestingly, the VARYS architecture is general and extendable, and can thus be used to support increasingly more platforms and probing technologies.

JCOMIX: a search-based tool to detect XML injection vulnerabilities in web applications

Input sanitization and validation of user inputs are well-established protection mechanisms for microservice architectures against XML injection attacks (XMLi). The effectiveness of the protection mechanisms strongly depends on the quality of the sanitization and validation rule sets (e.g., regular expressions) and, therefore, security analysts have to test them thoroughly. In this demo, we introduce JCOMIX, a penetration testing tool that generates XMLi attacks (test cases) exposing XML vulnerabilities in front-end web applications. JCOMIX implements various search algorithms, including random search (traditional fuzzing), genetic algorithms (GAs), and the more recent co-operative, co-evolutionary algorithm designed explicitly for the XMLi testing (COMIX). We also show the results of an empirical study showing the effectiveness of JCOMIX in testing an open-source front-end web application.

Event trace reduction for effective bug replay of Android apps via differential GUI state analysis

Existing Android testing tools, such as Monkey, generate a large quantity and a wide variety of user events to expose latent GUI bugs in Android apps. However, even if a bug is found, a majority of the events thus generated are often redundant and bug-irrelevant. In addition, it is also time-consuming for developers to localize and replay the bug given a long and tedious event sequence (trace).

This paper presents ECHO, an event trace reduction tool for effective bug replay by using a new differential GUI state analysis. Given a sequence of events (trace), ECHO aims at removing bug-irrelevant events by exploiting the differential behavior between the GUI states collected when their corresponding events are triggered. During dynamic testing, ECHO injects at most one lightweight inspection event after every event to collect its corresponding GUI state. A new adaptive model is proposed to selectively inject inspection events based on sliding windows to differentiate the GUI states on-the-fly in a single testing process. The experimental results show that ECHO improves the effectiveness of bug replay by removing 85.11% redundant events on average while also revealing the same bugs as those detected when full event sequences are used.

PyGGI 2.0: language independent genetic improvement framework

PyGGI is a research tool for Genetic Improvement (GI), that is designed to be versatile and easy to use. We present version 2.0 of PyGGI, the main feature of which is an XML-based intermediate program representation. It allows users to easily define GI operators and algorithms that can be reused with multiple target languages. Using the new version of PyGGI, we present two case studies. First, we conduct an Automated Program Repair (APR) experiment with the QuixBugs benchmark, one that contains defective programs in both Python and Java. Second, we replicate an existing work on runtime improvement through program specialisation for the MiniSAT satisfiability solver. PyGGI 2.0 was able to generate a patch for a bug not previously fixed by any APR tool. It was also able to achieve 14% runtime improvement in the case of MiniSAT. The presented results show the applicability and the expressiveness of the new version of PyGGI. A video of the tool demo is at:

CloneCognition: machine learning based code clone validation tool

A code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the original source codes. The generalization often results in returning code fragments that are only coincidentally similar and not considered clones by users, and hence requires manual validation of the reported possible clones by users which is often both time-consuming and challenging. In this paper, we propose a machine learning based tool 'CloneCognition' (Open Source Codes: ; Video Demonstration: to automate the laborious manual validation process. The tool runs on top of any code clone detection tools to facilitate the clone validation process. The tool shows promising clone classification performance with an accuracy of up to 87.4%. The tool also exhibits significant improvement in the results when compared with state-of-the-art techniques for code clone validation.

EVMFuzzer: detect EVM vulnerabilities via fuzz testing

Ethereum Virtual Machine (EVM) is the run-time environment for smart contracts and its vulnerabilities may lead to serious problems to the Ethereum ecology. With lots of techniques being continuously developed for the validation of smart contracts, the testing of EVM remains challenging because of the special test input format and the absence of oracles. In this paper, we propose EVMFuzzer, the first tool that uses differential fuzzing technique to detect vulnerabilities of EVM. The core idea is to continuously generate seed contracts and feed them to the target EVM and the benchmark EVMs, so as to find as many inconsistencies among execution results as possible, eventually discover vulnerabilities with output cross-referencing. Given a target EVM and its APIs, EVMFuzzer generates seed contracts via a set of predefined mutators, and then employs dynamic priority scheduling algorithm to guide seed contracts selection and maximize the inconsistency. Finally, EVMFuzzer leverages benchmark EVMs as cross-referencing oracles to avoid manual checking. With EVMFuzzer, we have found several previously unknown security bugs in four widely used EVMs, and 5 of which had been included in Common Vulnerabilities and Exposures (CVE) IDs in U.S. National Vulnerability Database. The video is presented at

A dynamic taint analyzer for distributed systems

As in other software domains, information flow security is a fundamental aspect of code security in distributed systems. However, most existing solutions to information flow security are limited to centralized software. For distributed systems, such solutions face multiple challenges, including technique applicability, tool portability, and analysis scalability. To overcome these challenges, we present DistTaint, a dynamic information flow (taint) analyzer for distributed systems. By partial-ordering method-execution events, DistTaint infers implicit dependencies in distributed programs, so as to resolve the applicability challenge. It resolves the portability challenge by working fully at application level, without customizing the runtime platform. To achieve scalability, it reduces analysis costs using a multi-phase analysis, where the pre-analysis phase generates method-level results to narrow down the scope of the following statement-level analysis. We evaluated DistTaint against eight real-world distributed systems. Empirical results showed DistTaint’s applicability to, portability with, and scalability for industry-scale distributed systems, along with its capability of discovering known and unknown vulnerabilities. A demo video for DistTaint can be downloaded from or viewed here online. The tool package is here:

Governify for APIs: SLA-driven ecosystem for API governance

As software architecture design is evolving to a microservice paradigm, RESTful APIs are being established as the preferred choice to build applications. In such a scenario, there is a shift towards a growing market of APIs where providers offer different service levels with tailored limitations typically based on the cost.

In such a context, while there are well-established standards to describe the functional elements of APIs (such as the OpenAPI Specification), having a standard model for Service Level Agreements (SLAs) for APIs may boost an open ecosystem of tools that would represent an improvement for the industry by automating certain tasks during the development.

In this paper, we introduce Governify for APIs, an ecosystem of tools aimed to support the user during the SLA-Driven RESTful APIs’ development process. Namely, an SLA Editor, an SLA Engine and an SLA Instrumentation Library. We also present a fully operational SLA-Driven API Gateway built on the top of our ecosystem of tools. To evaluate our proposal, we used three sources for gathering validation feedback: industry, teaching and research. Website: <a></a> Video: <a></a>

Developing secure bitcoin contracts with BitML

We present a toolchain for developing and verifying smart contracts that can be executed on Bitcoin. The toolchain is based on BitML, a recent domain-specific language for smart contracts with a computationally sound embedding into Bitcoin. Our toolchain automatically verifies relevant properties of contracts, among which liquidity, ensuring that funds do not remain frozen within a contract forever. A compiler is provided to translate BitML contracts into sets of standard Bitcoin transactions: executing a contract corresponds to appending these transactions to the blockchain. We assess our toolchain through a benchmark of representative contracts.

DISCOVER: detecting algorithmic complexity vulnerabilities

Algorithmic Complexity Vulnerabilities (ACV) are a class of vulnerabilities that enable Denial of Service Attacks. ACVs stem from asymmetric consumption of resources due to complex loop termination logic, recursion, and/or resource intensive library APIs. Completely automated detection of ACVs is intractable and it calls for tools that assist human analysts.

We present DISCOVER, a suite of tools that facilitates human-on-the-loop detection of ACVs. DISCOVER's workflow can be broken into three phases - (1) Automated characterization of loops, (2) Selection of suspicious loops, and (3) Interactive audit of selected loops. We demonstrate DISCOVER using a case study using a DARPA challenge app. DISCOVER supports analysis of Java source code and Java bytecode. We demonstrate it for Java bytecode.

AnswerBot: an answer summary generation tool based on stack overflow

Software Q&A sites (like Stack Overflow) play an essential role in developers’ day-to-day work for problem-solving. Although search engines (like Google) are widely used to obtain a list of relevant posts for technical problems, we observed that the redundant relevant posts and sheer amount of information barriers developers to digest and identify the useful answers. In this paper, we propose a tool AnswerBot which enables to automatically generate an answer summary for a technical problem. AnswerBot consists of three main stages, (1) relevant question retrieval, (2) useful answer paragraph selection, (3) diverse answer summary generation. We implement it in the form of a search engine website. To evaluate AnswerBot, we first build a repository includes a large number of Java questions and their corresponding answers from Stack Overflow. Then, we conduct a user study that evaluates the answer summary generated by AnswerBot and two baselines (based on Google and Stack Overflow search engine) for 100 queries. The results show that the answer summaries generated by AnswerBot are more relevant, useful, and diverse. Moreover, we also substantially improved the efficiency of AnswerBot (from 309 to 8 seconds per query).

Eagle: a team practices audit framework for agile software development

Agile/XP (Extreme Programming) software teams are expected to follow a number of specific practices in each iteration, such as estimating the effort (”points”) required to complete user stories, properly using branches and pull requests to coordinate merging multiple contributors’ code, having frequent ”standups” to keep all team members in sync, and conducting retrospectives to identify areas of improvement for future iterations. We combine two observations in developing a methodology and tools to help teams monitor their performance on these practices. On the one hand, many Agile practices are increasingly supported by web-based tools whose ”data exhaust” can provide insight into how closely the teams are following the practices. On the other hand, some of the practices can be expressed in terms similar to those developed for expressing service level objectives (SLO) in software as a service; as an example, a typical SLO for an interactive Web site might be ”over any 5-minute window, 99% of requests to the main page must be delivered within 200ms” and, analogously, a potential Team Practice (TP) for an Agile/XP team might be ”over any 2-week iteration, 75% of stories should be ’1-point’ stories”. Following this similarity, we adapt a system originally developed for monitoring and visualizing service level agreement (SLA) compliance to monitor selected TPs for Agile/XP software teams. Specifically, the system consumes and analyzes the data exhaust from widely-used tools such as GitHub and Pivotal Tracker and provides team(s) and coach(es) a ”dashboard” summarizing the teams’ adherence to various practices. As a qualitative initial investigation of its usefulness, we deployed it to twenty student teams in a four-sprint software engineering project course. We find an improvement of the adherence to team practice and a positive students’ self-evaluations of their team practices when using the tool, compared to previous experiences using an Agile/XP methodology. The demo video is located at <a></a> and a landing page with a live demo at <a></a>.

SESSION: Doctoral Symposium

A taxonomy of metrics for software fault prediction

In the field of Software Fault Prediction (SFP), researchers exploit software metrics to build predictive models using machine learning and/or statistical techniques. SFP has existed for several decades and the number of metrics used has increased dramatically. Thus, the need for a taxonomy of metrics for SFP arises firstly to standardize the lexicon used in this field so that the communication among researchers is simplified and then to organize and systematically classify the used metrics. In this doctoral symposium paper, I present my ongoing work which aims not only to build such a taxonomy as comprehensive as possible, but also to provide a global understanding of the metrics for SFP in terms of detailed information: acronym(s), extended name, univocal description, granularity of the fault prediction (e.g., method and class), category, and research papers in which they were used.

Distributed execution of test cases and continuous integration

I present here a part of the research conducted in my Ph.D. course. In particular, I focus on my ongoing work on how to support testing in the context of Continuous Integration (CI) development by distributing the execution of test cases (TCs) on geographically dispersed servers. I show how to find a trade-off between the cost of leased servers and the time to execute a given test suite (TS). The distribution and the execution of TCs on servers is modeled as a multi-objective optimization problem, where the goal is to balance the cost to lease servers and the time to execute TCs. The preliminary results : (i) show evidence of the existence of a Pareto Front (trade-off between costs to lease servers and TCs time) and (ii) suggest that the found solutions are worthwhile as compared to a traditional non-distributed TS execution (i.e., a single server/PC). Although the obtained results cannot be considered conclusive, it seems that the solutions are worth to speed up the testing activities in the context of CI.

A longitudinal field study on creation and use of domain-specific languages in industry

Domain-specific languages (DSLs) have extensively been investigated in research and have frequently been applied in practice for over 20 years. While DSLs have been attributed improvements in terms of productivity, maintainability, and taming accidental complexity, surprisingly, we know little about their actual impact on the software engineering practice. This PhD project, that is done in close collaboration with our industrial partner Océ - A Canon Company, offers a unique opportunity to study the application of DSLs using a longitudinal field study. In particular, we focus on introducing DSLs with language workbenches, i.e., infrastructures for designing and deploying DSLs, for projects that are already running for several years and for which extensive domain analysis outcomes are available. In doing so, we expect to gain a novel perspective on DSLs in practice. Additionally, we aim to derive best practices for DSL development and to identify and overcome limitations in the current state-of-the-art tooling for DSLs.

Failure-driven program repair

Program repair techniques can dramatically reduce the cost of program debugging by automatically generating program fixes. Although program repair has been already successful with several classes of faults, it also turned out to be quite limited in the complexity of the fixes that can be generated.

This Ph.D. thesis addresses the problem of cost-effectively generating fixes of higher complexity by investigating how to exploit failure information to directly shape the repair process. In particular, this thesis proposes Failure-Driven Program Repair, which is a novel approach to program repair that exploits its knowledge about both the possible failures and the corresponding repair strategies, to produce highly specialized repair tasks that can effectively generate non-trivial fixes.

On extending single-variant model transformations for reuse in software product line engineering

Software product line engineering (SPLE) aims at increasing productivity by following the principles of variability and organized reuse. Combining the discipline with model-driven software engineering (MDSE) seeks to intensify this effect by raising the level of abstraction. Typically, a product line developed in a model-driven way is composed of various kinds of models, like class diagrams and database schemata. To automatically generate further necessary representations from a initial (source) model, model transformations may create a respective target model. In annotative approaches to SPLE, variability annotations, which are boolean expressions over the features of the product line, state in which products a (model) element is visible. State-of-the-art single-variant model transformations (SVMT), however, do not consider variability annotations additionally associated with model elements. Thus, multi-variant model transformations (MVMT) should bridge the gap between existing SPLE and MDSE approaches by reusing already existing technology to propagate annotations additionally to the the target. The present contribution gives an overview on the research we conduct to reuse SVMTs in model-driven SPLE and provides a plan on which steps are still to be taken.

Exploratory test agents for stateful software systems

The adequate testing of stateful software systems is a hard and costly activity. Failures that result from complex stateful interactions can be of high impact, and it can be hard to replicate failures resulting from erroneous stateful interactions. Addressing this problem in an automatic way would save cost and time and increase the quality of software systems in the industry. In this paper, we propose an approach that uses agents to explore software systems with the intention to find faults and gain knowledge.

Helping developers search and locate task-relevant information in natural language documents

While performing a task, software developers interact with a myriad of natural language documents. Not all information in these documents is relevant to a developer's task forcing them to filter relevant information from large amounts of irrelevant information. If a developer misses some of the necessary information for her task, she will have an incomplete or incorrect basis from which to complete the task. Many approaches mine relevant text fragments from natural language artifacts. However, existing approaches mine information for pre-defined tasks and from a restricted set of artifacts. I hypothesize that it is possible to design a more generalizable approach that can identify, for a particular task, relevant text across different artifact types establishing relationships between them and facilitating how developers search and locate task-relevant information. To investigate this hypothesis, I propose to match a developer's task to text fragments in natural language artifacts according to their semantics. By semantically matching textual pieces to a developer's task we aim to more precisely identify fragments relevant to a task. To help developers in thoroughly navigating through the identified fragments I also propose to synthesize and group them. Ultimately, this research aims to help developers make more informed decisions regarding their software development task. Dr. Gail C. Murphy supervises this work.

Improving requirements engineering practices to support experimentation in software startups

The importance of startups to economic development is indisputable. Software startups are startups that develop an innovative software-intensive product or service. In spite of the rising of several methodologies to improve their efficiency, most of software startups still fail. There are several possible reasons to failure including under or over-engineering the product because of not-suitable engineering practices, wasted resources, and missed market opportunities. The literature argues that experimentation is essential to innovation and entrepreneurship. Even though well-known startup development methodologies employ it, studies revealed that practitioners still do not use it. Given that requirements engineering is in between software engineering and business, in this study, I aim to improve these practices to foster experimentation in software startups. To achieve that, first I investigated how requirements engineering activities are performed in software startups. Then, my goal is to propose new requirements engineering practices to foster experimentation in this context.

Managing the open cathedral

Already early in the history of open source projects it became apparent that they are driven by only a few contributors, creating the biggest portion of code. Whereas this has already been shown in previous research, this work adds a time perspective and considers the dynamics and evolution of communities. These aspects become increasingly important with the growing involvement of firms in such communities. Open source software is today used in many commercial applications, but also gets actively developed by businesses. Therefore, understanding and managing such projects into a common direction gets of increasing interest. The author’s work is intended to build a better understanding of these communities, their dynamics over time, key players and dependencies on them.

Machine-learning supported vulnerability detection in source code

The awareness of writing secure code rises with the increasing number of attacks and their resultant damage. But often, software developers are no security experts and vulnerabilities arise unconsciously during the development process. They use static analysis tools for bug detection, which often come with a high false positive rate. The developers, therefore, need a lot of resources to mind about all alarms, if they want to consistently take care of the security of their software project. We want to investigate, if machine learning techniques could point the user to the position of a security weak point in the source code with a higher accuracy than ordinary methods with static analysis. For this purpose, we focus on current machine learning on code approaches for our initial studies to evolve an efficient way for finding security-related software bugs. We will create a configuration interface to discover certain vulnerabilities, categorized in CWEs. We want to create a benchmark tool to compare existing source code representations and machine learning architectures for vulnerability detection and develop a customizable feature model. At the end of this PhD project, we want to have an easy-to-use vulnerability detection tool based on machine learning on code.

SESSION: Student Research Competition

Software clusterings with vector semantics and the call graph

In this paper, we propose a novel method to determine a software's modules without knowledge of its architectural structure, and empirically validate the method's performance. We cluster files by combining document embeddings, generated with the Doc2Vec algorithm, and the call graph, provided by Static Graph Analyzers to an augmented graph. We use the Louvain Algorithm to determine its community structure and propose a module-level clustering. Our method performs better in terms of stability, authoritativeness, and extremity over other state-of-the-art clustering methods proposed in the literature and is able to decently recover the ground truth clustering of the Linux Kernel. Finally, we conclude that semantic information from vector semantics as well as the call graph can produce accurate results for software clusterings of large systems.

Machine learning-assisted performance testing

Automated testing activities like automated test case generation imply a reduction in human effort and cost, with the potential to impact the test coverage positively. If the optimal policy, i.e., the course of actions adopted, for performing the intended test activity could be learnt by the testing system, i.e., a smart tester agent, then the learnt policy could be reused in analogous situations which leads to even more efficiency in terms of required efforts. Performance testing under stress execution conditions, i.e., stress testing, which involves providing extreme test conditions to find the performance breaking points, remains a challenge, particularly for complex software systems. Some common approaches for generating stress test conditions are based on source code or system model analysis, or use-case based design approaches. However, source code or precise system models might not be easily available for testing. Moreover, drawing a precise performance model is often difficult, particularly for complex systems. In this research, I have used model-free reinforcement learning to build a self-adaptive autonomous stress testing framework which is able to learn the optimal policy for stress test case generation without having a model of the system under test. The conducted experimental analysis shows that the proposed smart framework is able to generate the stress test conditions for different software systems efficiently and adaptively without access to performance models.

File tracing by intercepting disk requests

Existing file operation tracing methods are always OS-specific. It is a problem for file monitoring in some exotic operating systems and other programs running in privileged mode. We present a file system specific solution that could be work with any guest OS on a virtual machine. This solution works on the basis of intercepting requests to a block device. Our implementation is based on the QEMU emulator and supports ext2 and ext3 file systems.

Recommending related functions from API usage-based function clone structures

Developers need to be able to find reusable code for desired software features in a way that supports opportunistic programming for increased developer productivity. Our objective is to develop a recommendation system that provides a developer with function recommendations having functionality relevant to her development task. We employ a combination of information retrieval, static code analysis and data mining techniques to build the proposed recommendation system called FACER (Feature-driven API usage-based Code Examples Recommender). We performed an experimental evaluation on 122 projects from GitHub from selected categories to determine the accuracy of the retrieved code for related features. FACER recommended functions with a precision of 54% and 75% when evaluated using automated and manual methods respectively.

Identifying the most valuable developers using artifact traceability graphs

Finding the most valuable and indispensable developers is a crucial task in software development. We categorize these valuable developers into two categories: connector and maven. A typical connector represents a developer who connects different groups of developers in a large-scale project. Mavens represent the developers who are the sole experts in specific modules of the project.

To identify the connectors and mavens, we propose an approach using graph centrality metrics and connections of traceability graphs. We conducted a preliminary study on this approach by using two open source projects: QT 3D Studio and Android. Initial results show that the approach leads to identify the essential developers.

Automated patch porting across forked projects

Forking projects provides a straightforward method for developers to reuse existing source code and tailor it to their own application scenarios, which can significantly reduce developers' burden. However, this process makes forked projects (upstream projects and their forks) share the same defects on reused code as well. With the independent development of forked projects, some defects can only be repaired in one of them, where the patches need to be ported to others as well. Manually tracking all such activities among them is hard. Previous studies reveal that porting patches across forked projects is imperative and call research in this direction. Targeting at this problem, we conducted an empirical study to analyze the characteristics of patches in forked projects. We found that 20.5% patches need to be ported among all analyzed patches, which is a non-negligible portion. Among all those patches that need to be ported, 73.2% can be easily ported by simple syntactic code transformations. However, it is still challenging for other 26.8% patches since the corresponding code has experienced different modifications in the forked projects. As a result, according to the insights from the study, we proposed a new approach, which aims to automatically identify and port patches across forked projects.

Employing different program analysis methods to study bug evolution

The evolution of software bugs has been a well-studied topic in software engineering. We used three different program analysis tools to examine the different versions of two popular sets of programming tools (gnu Binary and Core utilities), and check if their bugs increase or decrease over time. Each tool is based on a different approach, namely: static analysis, symbolic execution, and fuzzing. In this way we can observe potential differences on the kinds of bugs that each tool detects and examine their effectiveness. To do so, we have performed a qualitative analysis on the results. Overall, our results indicate that we cannot say if bugs either decrease or increase over time and that the tools identify different bug types based on the method they follow.

Reducing the workload of the Linux kernel maintainers: multiple-committer model

With the increasing scale and complexity of software, the traditional development workflow may be inapplicable, which is harmful to the sustainable development of projects. In this study, we explored a new workflow — multiple-committer model that was applied by a subsystem of the Linux kernel to confront the heavy workload of the maintainers. We designed four dimensions of metrics toevaluate the model effect and found that this model conspicuouslyreduces the workload of the maintainers. We also obtained thecrucial factors for implementing this model.

Efficient computing in a safe environment

Modern computer systems are facing security challenges and thus are forced to employ various encryption, mitigation mechanisms, and other measures that affect significantly their performance. In this study, we aim to identify the energy and run-time performance implications of Meltdown and Spectre mitigation mechanisms. To achieve our goal, we experiment on server platform using different test cases. Our results highlight that request handling and memory operations are noticeably affected from mitigation mechanisms, both in terms of energy and run-time performance.

The lessons software engineers can extract from painters to improve the software development process

The way software is produced is very similar to an artistic and creative process. This fact is well known and has been acknowledged from the very early era of computers.

Moreover, there are even further similarities between the software development process and painting. This similarity can consequently lead to the assumption that software engineers can utilise the knowledge the artists employ, since the people have created artistic works for centuries. This paper focuses on the following questions: what similarities exist between the software development process and painting, what artistic practices could be profitably transferred to software development, and, in particular, with reference to pair programming, how do artists paint together and is there something that could be learned by software developers and engineers.

The uniqueness of the proposed approach lies in the exploration of the novel idea with the use of a complex way to examine the topic, and considering developers primarily as creative people, not ordinary industrial workers.

An industrial application of test selection using test suite diagnosability

Performing full regression testing every time a change is made on large software systems tends to be unfeasible as it takes too long to run all the test cases. The main motivation of this work was to provide a shorter and earlier feedback loop to the developers at OutSystems when a change is made (instead of having to wait for slower feedback from a CI pipeline). The developed tool, MOTSD, implements a multi-objective test selection approach in a C# code base using a test suite diagnosability metric and historical metrics as objectives and it is powered by a particle swarm optimization algorithm. This paper presents implementation challenges, current experimental results and limitations of the developed approach when applied in an industrial context.

Understanding source code comments at large-scale

Source code comments are important for any software, but the basic patterns of writing comments across domains and programming languages remain unclear. In this paper, we take a first step toward understanding differences in commenting practices by analyzing the comment density of 150 projects in 5 different programming languages. We have found that there are noticeable differences in comment density, which may be related to the programming language used in the project and the purpose of the project.

A graph-based framework for analysing the design of smart contracts

Used as a platform for executing smart contracts, Blockchain technology has yielded new programming languages. We propose a graph-based framework for computing software design metrics for the Solidity programming language, and use this framework in a preliminary study on 505 smart contracts mined from GitHub. The results show that most of the smart contracts are rather straightforward from an objected-oriented point of view and that new design metrics specific to smart contracts should be developed.

Finding the shortest path to reproduce a failure found by TESTAR

TESTAR is a tool for automated testing via the GUI. It uses dynamic analysis during automated GUI exploration and generates the test sequences during the execution. TESTAR saves all kind of information about the tests in a Graph database that can be queried or traversed during or after the tests using a traversal language. Test sequences leading to a failure can be excessively long, making the root-cause analysis of the failure difficult. This paper proposes an initial approach to find the shortest path to reproduce an error found by TESTAR

Analysing socio-technical congruence in the package dependency network of Cargo

Software package distributions form large dependency networks maintained by large communities of contributors. My PhD research will consist of analysing the evolution of the socio-technical congruence of these package dependency networks, and studying its impact on the health of the ecosystem and its community. I have started a longitudinal empirical study of Cargo's dependency network and the social (commenting) and technical (development) activities in Cargo's package repositories on GitHub, and present some preliminary findings.

Tuning backfired? not (always) your fault: understanding and detecting configuration-related performance bugs

Performance bugs (PBugs) are often hard to detect due to their non fail-stop symptoms. Existing debugging techniques can only detect PBugs with known patterns (e.g. inefficient loops). The key reason behind this incapability is the lack of a general test oracle. Here, we argue that the configuration tuning can serve as a strong candidate for PBugs detection. First, prior work shows that most performance bugs are related to configurations. Second, the tuning reflects users’ expectation of performance changes. If the actual performance behaves differently from the users’ intuition, the related code segment is likely to be problematic.

In this paper, we first conduct a comprehensive study on configuration related performance bugs(CPBugs) from 7 representative softwares (i.e., MySQL, MariaDB, MongoDB, RocksDB, PostgreSQL, Apache and Nginx) and collect 135 real-world CPBugs. Next, by further analyzing the symptoms and root causes of the collected bugs, we identify 7 counter-intuitive patterns. Finally, by integrating the counter-intuitive patterns, we build a general test framework for detecting performance bugs.

On the use of lambda expressions in 760 open source Python projects

Lambdas as anonymous functions have gained significant prominence in programming languages such as Java, C++, Python and so on as developers tend to use them. With the dominant use of Python as backend language in many projects and large number of open source projects available, we set out to investigate the use of lambdas in Python projects and obtained 19 categories to classify lambda usages as preliminary results. Our study could help language designers to improve the state of the art libraries for lambda expressions and developers to use lambda expressions effectively.

Test-related factors and post-release defects: an empirical study

Testing is a very important activity whose purpose is to ensure software quality. Recent studies have studied the effects of test-related factors (e.g., code coverage) on software code quality, showing that they have good predictive power on post-release defects. Despite these studies demonstrated the existence of a relation between test-related factors and software code quality, they considered different factors separately. That led us to conduct an additional empirical study in which we considered these factors all together. The key findings of the study show that, while post-release defects are strongly related to process and code metrics of the production classes, test-related factors have a limited prediction impact.

Static deep neural network analysis for robustness

This work studies the static structure of deep neural network models using white box based approach and utilizes that knowledge to find the susceptible classes which can be misclassified easily. With the knowledge of susceptible classes, our work has proposed to retrain the model for those classes to achieve increased robustness. Our preliminary result has been evaluated on MNIST, F-MNIST, and CIFAR-10 (ImageNet and ResNet-32 model) based datasets and have been compared with two state-of-the-art detectors.

Are existing code smells relevant in web games? an empirical study

In software applications, code smells are considered as bad coding practices acquired at the time of development. The presence of such code smells in games may affect the process of game development adversely. Our preliminary study aims at investigating the existence of code smells in the games. To achieve this, we used JavaScript code smells detection tool JSNose against 361 JavaScript web games to find occurrences of JavaScript smells in games. Further, we conducted a manual study to find violations of known game programming patterns in 8 web games to verify the necessity of game-specific code smells detection tool. Our results shows that existing JavaScript code smells detection tool is not sufficient to find game-specific code smells in web games.

Tackling knowledge needs during software evolution

Developers use a large amount of their time to understand the system they work on, an activity referred to as program comprehension. Especially software evolution and forgetting over time lead to developers becoming unfamiliar with a system. To support them during program comprehension, we can employ knowledge recovery to reverse engineer implicit information from the system and the platform (e.g., GitHub) it is hosted on. However, to recover useful knowledge and to provide it in a useful way, we first need to understand what knowledge developers forget to what extent, what sources are reliable to recover knowledge, and how to trace knowledge to the features in a system. We tackle these three issues, aiming to provide empirical insights and tooling to support developers during software evolution and maintenance. The results help practitioners, as we support the analysis and understanding of systems, as well as researchers, showing opportunities to automate, for example, reverse-engineering techniques.

On the scalable dynamic taint analysis for distributed systems

To protect the privacy and search sensitive data leaks, we must solve multiple challenges (e.g., applicability, portability, and scalability) for developing an appropriate taint analysis for distributed systems.We hence present DistTaint, a dynamic taint analysis for distributed systems against these challenges. It could infer implicit dependencies from partial-ordering method events in executions to resolve the applicability challenge. DistTaint fully works at application-level without any customization of platforms to overcome the portability challenge. It exploits a multi-phase analysis to achieve scalability. By proposing a pre-analysis, DistTaint narrows down the following fine-grained analysis’ scope to reduce the overall cost significantly. Empirical results showed DistTaint’s practical applicability, portability, and scalability to industry-scale distributed programs, and its capability of discovering security vulnerabilities in real-world distributed systems. The tool package can be downloaded here:

Suggesting reviewers of software artifacts using traceability graphs

During the lifecycle of a software project, software artifacts constantly change. A change should be peer-reviewed to ensure the software quality. To maximize the benefit of review, the reviewer(s) should be chosen appropriately. However, choosing the right reviewer(s) might not be trivial especially in large projects. Researchers developed different methods to recommend reviewers. In this study, we introduce a novel approach for reviewer recommendation problem. Our approach utilizes the traceability graph of a software project and assigns a know-about score to each developer, then recommends the developers who have the maximum know-about score for an artifact. We tested our approach on an open source project and achieved top-3 recall of 0.85 with an MRR (mean reciprocal ranking) of 0.73.

Using software testing to repair models

Software testing is an important phase in the software development process, aiming at locating faults in artifacts, and achieve some confidence that the software behaves according to specification. There exists many software testing techniques applied to debugging, fault-localization, and repair of code, however, to the best of our knowledge, the application of software testing to locating faults in models and automatically repair them, is still an open issue. We present a project that investigates the use of software testing methods to automatically repair model artifacts, to support engineers in maintaining them consistent with the implementation and specification. We describe the research approach, the structure of the devised test-driven repair processes, present results in the cases of combinatorial models and feature models, and finally discuss future work of applying testing to repair models for other scenarios, such as timed automata.

Rethinking Regex engines to address ReDoS

Regular expressions (regexes) are a powerful string manipulation tool. Unfortunately, in programming languages like Python, Java, and JavaScript, they are unnecessarily dangerous, implemented with worst-case exponential matching behavior. This high time complexity exposes software services to regular expression denial of service (ReDoS) attacks.

We propose a data-driven redesign of regex engines, to reflect how regexes are used and what they typically look like. We report that about 95% of regexes in popular programming languages can be evaluated in linear time. The regex engine is a fundamental component of a programming language, and any changes risk introducing compatibility problems. We believe a full redesign is therefore impractical, and so we describe how the vast majority of regex matches can be made linear-time with minor, not major, changes to existing algorithms. Our prototype shows that on a kernel of the regex language, we can trade space for time to make regex matches safe

Context-aware test case adaptation

During software evolution, both production code and test cases evolve frequently. To assure software quality, test cases should evolve in time. However, test case evolution is usually delayed and error-prone. To facilitate this process, this paper proposes a context-aware test case adaptation approach (CAT), which first identifies and generalizes test case adaptation patterns, and then applies these patterns to automatically evolve test cases by analyzing their context. We conducted a preliminary study on three open-source projects and found that CAT correctly adapts 71.91% of test cases.

Empirical study of customer communication problem in agile requirements engineering

As Agile principles and values become an integral part of the soft-ware development culture, development processes experience significant changes. Requirements engineering, an individual phase occurring at the beginning of the traditional development, is distributed between various activities according to agile. However, how customer communication related problems are solved within the context of agile requirements engineering (RE)? Empirical study of that problem is done using 2 methods: systematic literature review and semi-structured interviews. Problems related to customer communication in agile RE are revealed and composed into patterns. Patterns are to be supplemented with the solutions in the further research.