ICSE '18- Proceedings of the 40th International Conference on Software Engineering

Full Citation in the ACM Digital Library

SESSION: Software repair I

Context-aware patch generation for better automated program repair

The effectiveness of search-based automated program repair is limited in the number of correct patches that can be successfully generated. There are two causes of such limitation. First, the search space does not contain the correct patch. Second, the search space is huge and therefore the correct patch cannot be generated (i.e., correct patches are either generated after incorrect plausible ones or not generated within the time budget).

To increase the likelihood of including the correct patches in the search space, we propose to work at a fine granularity in terms of AST nodes. This, however, will further enlarge the search space, increasing the challenge to find the correct patches. We address the challenge by devising a strategy to prioritize the candidate patches based on their likelihood of being correct. Specifically, we study the use of AST nodes' context information to estimate the likelihood.

In this paper, we propose CapGen, a context-aware patch generation technique. The novelty which allows CapGen to produce more correct patches lies in three aspects: (1) The fine-granularity design enables it to find more correct fixing ingredients; (2) The context-aware prioritization of mutation operators enables it to constrain the search space; (3) Three context-aware models enable it to rank correct patches at high positions before incorrect plausible ones. We evaluate CapGen on Defects4J and compare it with the state-of-the-art program repair techniques. Our evaluation shows that CapGen outperforms and complements existing techniques. CapGen achieves a high precision of 84.00% and can prioritize the correct patches before 98.78% of the incorrect plausible ones.

Towards practical program repair with on-demand candidate generation

Effective program repair techniques, which modify faulty programs to fix them with respect to given test suites, can substantially reduce the cost of manual debugging. A common repair approach is to iteratively first generate candidate programs with possible bug fixes and then validate them against the given tests until a candidate that passes all the tests is found. While this approach is conceptually simple, due to the potentially high number of candidates that need to first be generated and then be compiled and tested, existing repair techniques that embody this approach have relatively low effectiveness, especially for faults at a fine granularity.

To tackle this limitation, we introduce a novel repair technique, SketchFix, which generates candidate fixes on demand (as needed) during the test execution. Instead of iteratively re-compiling and re-executing each actual candidate program, SketchFix translates faulty programs to sketches, i.e., partial programs with "holes", and compiles each sketch once which may represent thousands of concrete candidates. With the insight that the space of candidates can be reduced substantially by utilizing the runtime behaviors of the tests, SketchFix lazily initializes the candidates of the sketches while validating them against the test execution.

We experimentally evaluate SketchFix on the Defects4J benchmark and the experimental results show that SketchFix works particularly well in repairing bugs with expression manipulation at the AST node-level granularity compared to other program repair techniques. Specifically, SketchFix correctly fixes 19 out of 357 defects in 23 minutes on average using the default setting. In addition, SketchFix finds the first repair with 1.6% of re-compilations (#compiled sketches/#candidates) and 3.0% of re-executions out of all repair candidates.

A correlation study between automated program repair and test-suite metrics

Automated program repair has attracted attention due to its potential to reduce debugging cost. Prior works show the feasibility of automated repair, and the research focus is gradually shifting towards the quality of generated patches. One promising direction is to control the quality of generated patches by controlling the quality of test-suites used. In this paper, 1we investigate the question: "Can traditional test-suite metrics used in software testing be used for automated program repair?". We empirically investigate the effectiveness of test-suite metrics (statement / branch coverage and mutation score) in controlling the reliability of repairs (the likelihood that repairs cause regressions). We conduct the largest-scale experiments to date with real-world software, and perform the first correlation study between test-suite metrics and the reliability of generated repairs. Our results show that by increasing test-suite metrics, the reliability of repairs tend to increase. Particularly, such trend is most strongly observed in statement coverage. This implies that traditional test-suite metrics used in software testing can also be used to improve the reliability of repairs in program repair.

Do automated program repair techniques repair hard and important bugs?

Automated program repair techniques use a buggy program and a partial specification (typically a test suite) to produce a program variant that satisfies the specification. While prior work has studied patch quality [10, 11] and maintainability [2], it has not examined whether automated repair techniques are capable of repairing defects that developers consider important or that are hard for developers to repair manually. This paper tackles those questions.

SESSION: Apps and app stores I

Software protection on the go: a large-scale empirical study on mobile app obfuscation

The prosperity of smartphone markets has raised new concerns about software security on mobile platforms, leading to a growing demand for effective software obfuscation techniques. Due to various differences between the mobile and desktop ecosystems, obfuscation faces both technical and non-technical challenges when applied to mobile software. Although there have been quite a few software security solution providers launching their mobile app obfuscation services, it is yet unclear how real-world mobile developers perform obfuscation as part of their software engineering practices.

Our research takes a first step to systematically studying the deployment of software obfuscation techniques in mobile software development. With the help of an automated but coarse-grained method, we computed the likelihood of an app being obfuscated for over a million app samples crawled from Apple App Store. We then inspected the top 6600 instances and managed to identify 601 obfuscated versions of 539 iOS apps. By analyzing this sample set with extensive manual effort, we made various observations that reveal the status quo of mobile obfuscation in the real world, providing insights into understanding and improving the situation of software protection on mobile platforms.

GUILeak: tracing privacy policy claims on user input data for Android applications

The Android mobile platform supports billions of devices across more than 190 countries around the world. This popularity coupled with user data collection by Android apps has made privacy protection a well-known challenge in the Android ecosystem. In practice, app producers provide privacy policies disclosing what information is collected and processed by the app. However, it is difficult to trace such claims to the corresponding app code to verify whether the implementation is consistent with the policy. Existing approaches for privacy policy alignment focus on information directly accessed through the Android platform (e.g., location and device ID), but are unable to handle user input, a major source of private information. In this paper, we propose a novel approach that automatically detects privacy leaks of user-entered data for a given Android app and determines whether such leakage may violate the app's privacy policy claims. For evaluation, we applied our approach to 120 popular apps from three privacy-relevant app categories: finance, health, and dating. The results show that our approach was able to detect 21 strong violations and 18 weak violations from the studied apps.

Online app review analysis for identifying emerging issues

Detecting emerging issues (e.g., new bugs) timely and precisely is crucial for developers to update their apps. App reviews provide an opportunity to proactively collect user complaints and promptly improve apps' user experience, in terms of bug fixing and feature refinement. However, the tremendous quantities of reviews and noise words (e.g., misspelled words) increase the difficulties in accurately identifying newly-appearing app issues. In this paper, we propose a novel and automated framework IDEA, which aims to IDentify Emerging App issues effectively based on online review analysis. We evaluate IDEA on six popular apps from Google Play and Apple's App Store, employing the official app changelogs as our ground truth. Experiment results demonstrate the effectiveness of IDEA in identifying emerging app issues. Feedback from engineers and product managers shows that 88.9% of them think that the identified issues can facilitate app development in practice. Moreover, we have successfully applied IDEA to several products of Tencent, which serve hundreds of millions of users.

EARMO: an energy-aware refactoring approach for mobile apps

With millions of smartphones sold every year, the development of mobile apps has grown substantially. The battery power limitation of mobile devices has push developers and researchers to search for methods to improve the energy efficiency of mobile apps. We propose a multiobjective refactoring approach to automatically improve the architecture of mobile apps, while controlling for energy efficiency. In this extended abstract we briefly summarize our work.

SESSION: Software evolution and maintenance I

Neuro-symbolic program corrector for introductory programming assignments

Automatic correction of programs is a challenging problem with numerous real world applications in security, verification, and education. One application that is becoming increasingly important is the correction of student submissions in online courses for providing feedback. Most existing program repair techniques analyze Abstract Syntax Trees (ASTs) of programs, which are unfortunately unavailable for programs with syntax errors. In this paper, we propose a novel Neuro-symbolic approach that combines neural networks with constraint-based reasoning. Specifically, our method first uses a Recurrent Neural Network (RNN) to perform syntax repairs for the buggy programs; subsequently, the resulting syntactically-fixed programs are repaired using constraint-based techniques to ensure functional correctness. The RNNs are trained using a corpus of syntactically correct submissions for a given programming assignment, and are then queried to fix syntax errors in an incorrect programming submission by replacing or inserting the predicted tokens at the error location. We evaluate our technique on a dataset comprising of over 14,500 student submissions with syntax errors. Our method is able to repair syntax errors in 60% (8689) of submissions, and finds functionally correct repairs for 23.8% (3455) submissions.

Automated localization for unreproducible builds

Reproducibility is the ability of recreating identical binaries under pre-defined build environments. Due to the need of quality assurance and the benefit of better detecting attacks against build environments, the practice of reproducible builds has gained popularity in many open-source software repositories such as Debian and Bitcoin. However, identifying the unreproducible issues remains a labour intensive and time consuming challenge, because of the lacking of information to guide the search and the diversity of the causes that may lead to the unreproducible binaries.

In this paper we propose an automated framework called RepLoc to localize the problematic files for unreproducible builds. RepLoc features a query augmentation component that utilizes the information extracted from the build logs, and a heuristic rule-based filtering component that narrows the search scope. By integrating the two components with a weighted file ranking module, RepLoc is able to automatically produce a ranked list of files that are helpful in locating the problematic files for the unreproducible builds. We have implemented a prototype and conducted extensive experiments over 671 real-world unreproducible Debian packages in four different categories. By considering the topmost ranked file only, RepLoc achieves an accuracy rate of 47.09%. If we expand our examination to the top ten ranked files in the list produced by RepLoc, the accuracy rate becomes 79.28%. Considering that there are hundreds of source code, scripts, Makefiles, etc., in a package, RepLoc significantly reduces the scope of localizing problematic files. Moreover, with the help of RepLoc, we successfully identified and fixed six new unreproducible packages from Debian and Guix.

Enlightened debugging

Numerous automated techniques have been proposed to reduce the cost of software debugging, a notoriously time-consuming and human-intensive activity. Among these techniques, Statistical Fault Localization (SFL) is particularly popular. One issue with SFL is that it is based on strong, often unrealistic assumptions on how developers behave when debugging. To address this problem, we propose Enlighten, an interactive, feedback-driven fault localization technique. Given a failing test, Enlighten (1) leverages SFL and dynamic dependence analysis to identify suspicious method invocations and corresponding data values, (2) presents the developer with a query about the most suspicious invocation expressed in terms of inputs and outputs, (3) encodes the developer feedback on the correctness of individual data values as extra program specifications, and (4) repeats these steps until the fault is found. We evaluated Enlighten in two ways. First, we applied Enlighten to 1,807 real and seeded faults in 3 open source programs using an automated oracle as a simulated user; for over 96% of these faults, Enlighten required less than 10 interactions with the simulated user to localize the fault, and a sensitivity analysis showed that the results were robust to erroneous responses. Second, we performed an actual user study on 4 faults with 24 participants and found that participants who used Enlighten performed significantly better than those not using our tool, in terms of both number of faults localized and time needed to localize the faults.

Experiences and challenges in building a data intensive system for data migration

Recent analyses[2, 4, 5] report that many sectors of our economy and society are more and more guided by data-driven decision processes (e.g., health care, public administrations, etc.). As such, Data Intensive (DI) applications are becoming more and more important and critical. They must be fault-tolerant, they should scale with the amount of data, and be able to elastically leverage additional resources as and when these last ones are provided [3]. Moreover, they should be able to avoid data drops introduced in case of sudden overloads and should offer some Quality of Service (QoS) guarantees.

SESSION: Human and social aspects of computing I

Sentiment analysis for software engineering: how far can we go?

Sentiment analysis has been applied to various software engineering (SE) tasks, such as evaluating app reviews or analyzing developers' emotions in commit messages. Studies indicate that sentiment analysis tools provide unreliable results when used out-of-the-box, since they are not designed to process SE datasets. The silver bullet for a successful application of sentiment analysis tools to SE datasets might be their customization to the specific usage context.

We describe our experience in building a software library recommender exploiting developers' opinions mined from Stack Overflow. To reach our goal, we retrained---on a set of 40k manually labeled sentences/words extracted from Stack Overflow---a state-of-the-art sentiment analysis tool exploiting deep learning. Despite such an effort- and time-consuming training process, the results were negative. We changed our focus and performed a thorough investigation of the accuracy of commonly used tools to identify the sentiment of SE related texts. Meanwhile, we also studied the impact of different datasets on tool performance. Our results should warn the research community about the strong limitations of current sentiment analysis tools.

Identifying features in forks

Fork-based development has been widely used both in open source communities and in industry, because it gives developers flexibility to modify their own fork without affecting others. Unfortunately, this mechanism has downsides: When the number of forks becomes large, it is difficult for developers to get or maintain an overview of activities in the forks. Current tools provide little help. We introduce Infox, an approach to automatically identify non-merged features in forks and to generate an overview of active forks in a project. The approach clusters cohesive code fragments using code and network-analysis techniques and uses information-retrieval techniques to label clusters with keywords. The clustering is effective, with 90 % accuracy on a set of known features. In addition, a human-subject evaluation shows that Infox can provide actionable insight for developers of forks.

Roles and impacts of hands-on software architects in five industrial case studies

Whether software architects should also code is an enduring question. In order to satisfy performance, security, reliability and other quality concerns, architects need to compare and carefully choose a combination of architectural patterns, styles or tactics. Then later in the development cycle, these architectural choices must be implemented completely and correctly so there will not be any drift from envisioned design. In this paper, we use data analytics-based techniques to study five large-scale software systems, examining the impact and the role of software architects who write code on software quality. Our quantitative study is augmented with a follow up interview of architects. This paper provides empirical evidence for supporting the pragmatic opinions that architects should write code. Our analysis shows that implementing architectural tactics is more complex than delivering functionality, tactics are more error prone than software functionalities, and the architects tend to introduce fewer bugs into the implementation of architectural tactics compared to the developers.

Sentiment polarity detection for software development

The role of sentiment analysis is increasingly emerging to study software developers' emotions by mining crowd-generated content within software repositories and information sources. With a few notable exceptions [1][5], empirical software engineering studies have exploited off-the-shelf sentiment analysis tools. However, such tools have been trained on non-technical domains and general-purpose social media, thus resulting in misclassifications of technical jargon and problem reports [2][4]. In particular, Jongeling et al. [2] show how the choice of the sentiment analysis tool may impact the conclusion validity of empirical studies because not only these tools do not agree with human annotation of developers' communication channels, but they also disagree among themselves.

Our goal is to move beyond the limitations of off-the-shelf sentiment analysis tools when applied in the software engineering domain. Accordingly, we present Senti4SD, a sentiment polarity classifier for software developers' communication channels. Senti4SD exploits a suite of lexicon-based, keyword-based, and semantic features for appropriately dealing with the domain-dependent use of a lexicon. We built a Distributional Semantic Model (DSM) to derive the semantic features exploited by Senti4SD. Specifically, we ran word2vec [3] on a collection of over 20 million documents from Stack Overflow, thus obtaining word vectors that are representative of developers' communication style. The classifier is trained and validated using a gold standard of 4,423 Stack Overflow posts, including questions, answers, and comments, which were manually annotated for sentiment polarity.

We release the full lab package2, which includes both the gold standard and the emotion annotation guidelines, to ease the execution of replications as well as new studies on emotion awareness in software engineering. To inform future research on word embedding for text categorization and information retrieval in software engineering, the replication kit also includes the DSM.

Results. The contribution of the lexicon-based, keyword-based, and semantic features is assessed by our empirical evaluation leveraging different feature settings. With respect to SentiStrength [6], a mainstream off-the-shelf tool that we use as a baseline, Senti4SD reduces the misclassifications of neutral and positive posts as emotionally negative. Furthermore, we provide empirical evidence of better performance also in presence of a minimal set of training documents.

SESSION: Software repair II

Semantic program repair using a reference implementation

Automated program repair has been studied via the use of techniques involving search, semantic analysis and artificial intelligence. Most of these techniques rely on tests as the correctness criteria, which causes the test overfitting problem. Although various approaches such as learning from code corpus have been proposed to address this problem, they are unable to guarantee that the generated patches generalize beyond the given tests. This work studies automated repair of errors using a reference implementation. The reference implementation is symbolically analyzed to automatically infer a specification of the intended behavior. This specification is then used to synthesize a patch that enforces conditional equivalence of the patched and the reference programs. The use of the reference implementation as an implicit correctness criterion alleviates overfitting in test-based repair. Besides, since we generate patches by semantic analysis, the reference program may have a substantially different implementation from the patched program, which distinguishes our approach from existing techniques for regression repair like Relifix. Our experiments in repairing the embedded Linux Busybox with GNU Coreutils as reference (and vice-versa) revealed that the proposed approach scales to real-world programs and enables the generation of more correct patches.

Automated repair of mobile friendly problems in web pages

Mobile devices have become a primary means of accessing the Internet. Unfortunately, many websites are not designed to be mobile friendly. This results in problems such as unreadable text, cluttered navigation, and content overflowing a device's viewport; all of which can lead to a frustrating and poor user experience. Existing techniques are limited in helping developers repair these mobile friendly problems. To address this limitation of prior work, we designed a novel automated approach for repairing mobile friendly problems in web pages. Our empirical evaluation showed that our approach was able to successfully resolve mobile friendly problems in 95% of the evaluation subjects. In a user study, participants preferred our repaired versions of the subjects and also considered the repaired pages to be more readable than the originals.

Static automated program repair for heap properties

Static analysis tools have demonstrated effectiveness at finding bugs in real world code. Such tools are increasingly widely adopted to improve software quality in practice. Automated Program Repair (APR) has the potential to further cut down on the cost of improving software quality. However, there is a disconnect between these effective bug-finding tools and APR. Recent advances in APR rely on test cases, making them inapplicable to newly discovered bugs or bugs difficult to test for deterministically (like memory leaks). Additionally, the quality of patches generated to satisfy a test suite is a key challenge. We address these challenges by adapting advances in practical static analysis and verification techniques to enable a new technique that finds and then accurately fixes real bugs without test cases. We present a new automated program repair technique using Separation Logic. At a high-level, our technique reasons over semantic effects of existing program fragments to fix faults related to general pointer safety properties: resource leaks, memory leaks, and null dereferences. The procedure automatically translates identified fragments into source-level patches, and verifies patch correctness with respect to reported faults. In this work we conduct the largest study of automatically fixing undiscovered bugs in real-world code to date. We demonstrate our approach by correctly fixing 55 bugs, including 11 previously undiscovered bugs, in 11 real-world projects.

Overfitting in semantics-based automated program repair

Existing APR techniques can be generally divided into two families: semantics- vs. heuristics-based. Semantics-based APR uses symbolic execution and test suites to extract semantic constraints, and uses program synthesis to synthesize repairs that satisfy the extracted constraints. Heuristic-based APR generates large populations of repair candidates via source manipulation, and searches for the best among them. Both families largely rely on a primary assumption that a program is correctly patched if the generated patch leads the program to pass all provided test cases. Patch correctness is thus an especially pressing concern. A repair technique may generate overfitting patches, which lead a program to pass all existing test cases, but fails to generalize beyond them. In this work, we revisit the overfitting problem with a focus on semantics-based APR techniques, complementing previous studies of the overfitting problem in heuristics-based APR. We perform our study using IntroClass and Codeflaws benchmarks, two datasets well-suited for assessing repair quality, to systematically characterize and understand the nature of overfitting in semantics-based APR. We find that similar to heuristics-based APR, overfitting also occurs in semantics-based APR in various different ways.

SESSION: Apps and app stores II

Studying the dialogue between users and developers of free apps in the google play store

The popularity of mobile apps continues to grow over the past few years. Mobile app stores, such as the Google Play Store and Apple's App Store provide a unique user feedback mechanism to app developers through app reviews. In the Google Play Store (and most recently in the Apple App Store), developers are able to respond to such user feedback.

Over the past years, mobile app reviews have been studied excessively by researchers. However, much of prior work (including our own prior work) incorrectly assumes that reviews are static in nature and that users never update their reviews. In a recent study, we started analyzing the dynamic nature of the review-response mechanism. Our previous study showed that responding to a review often has a positive effect on the rating that is given by the user to an app.

In this paper [1], we revisit our prior finding in more depth by studying 4.5 million reviews with 126,686 responses of 2,328 top free-to-download apps in the Google Play Store. One of the major findings of our paper is that the assumption that reviews are static is incorrect. In particular, we find that developers and users in some cases use this response mechanism as a rudimentary user support tool, where dialogues emerge between users and developers through updated reviews and responses. Even though the messages are often simple, we find instances of as many as ten user-developer back-and-forth messages that occur via the response mechanism.

Using a mixed-effect model, we identify that the likelihood of a developer responding to a review increases as the review rating gets lower or as the review content gets longer. In addition, we identify four patterns of developers: 1) developers who primarily respond to only negative reviews, 2) developers who primarily respond to negative reviews or to reviews based on their content, 3) developers who primarily respond to reviews which are posted shortly after the latest release of their app, and 4) developers who primarily respond to reviews which are posted long after the latest release of their app.

We perform a qualitative analysis of developer responses to understand what drives developers to respond to a review. We manually analyzed a statistically representative random sample of 347 reviews with responses of the top ten apps with the highest number of developer responses. We identify seven drivers that make a developer respond to a review, of which the most important ones are to thank the users for using the app and to ask the user for more details about the reported issue.

Our findings show that it can be worthwhile for app owners to respond to reviews, as responding may lead to an increase in the given rating. In addition, our findings show that studying the dialogue between users and developers provides valuable insights that can lead to improvements in the app store and the user support process.

The main contributions of this paper are as follows: (1) Our paper is the first work to demonstrate the dynamic nature of reviews. (2) Furthermore, we are the first to demonstrate a peculiar use of the app-review platforms as a user support medium. (3) In addition, our work is the first work to deeply explore developer responses in a systematic manner. (4) Finally, our classification of developer-responses highlights the value of providing canned or even automated responses in next generation app-review platforms.

Automated reporting of GUI design violations for mobile apps

The inception of a mobile app often takes form of a mock-up of the Graphical User Interface (GUI), represented as a static image delineating the proper layout and style of GUI widgets that satisfy requirements. Following this initial mock-up, the design artifacts are then handed off to developers whose goal is to accurately implement these GUIs and the desired functionality in code. Given the sizable abstraction gap between mock-ups and code, developers often introduce mistakes related to the GUI that can negatively impact an app's success in highly competitive marketplaces. Moreover, such mistakes are common in the evolutionary context of rapidly changing apps. This leads to the time-consuming and laborious task of design teams verifying that each screen of an app was implemented according to intended design specifications.

This paper introduces a novel, automated approach for verifying whether the GUI of a mobile app was implemented according to its intended design. Our approach resolves GUI-related information from both implemented apps and mock-ups and uses computer vision techniques to identify common errors in the implementations of mobile GUIs. We implemented this approach for Android in a tool called Gvt and carried out both a controlled empirical evaluation with open-source apps as well as an industrial evaluation with designers and developers from Huawei. The results show that Gvt solves an important, difficult, and highly practical problem with remarkable efficiency and accuracy and is both useful and scalable from the point of view of industrial designers and developers. The tool is currently used by over one-thousand industrial designers & developers at Huawei to improve the quality of their mobile apps.

Leveraging program analysis to reduce user-perceived latency in mobile applications

Reducing network latency in mobile applications is an effective way of improving the mobile user experience and has tangible economic benefits. This paper presents PALOMA, a novel client-centric technique for reducing the network latency by prefetching HTTP requests in Android apps. Our work leverages string analysis and callback control-flow analysis to automatically instrument apps using PALOMA's rigorous formulation of scenarios that address "what" and "when" to prefetch. PALOMA has been shown to incur significant runtime savings (several hundred milliseconds per prefetchable HTTP request), both when applied on a reusable evaluation benchmark we have developed and on real applications.

Repairing crashes in Android apps

Android apps are omnipresent, and frequently suffer from crashes --- leading to poor user experience and economic loss. Past work focused on automated test generation to detect crashes in Android apps. However, automated repair of crashes has not been studied. In this paper, we propose the first approach to automatically repair Android apps, specifically we propose a technique for fixing crashes in Android apps. Unlike most test-based repair approaches, we do not need a test-suite; instead a single failing test is meticulously analyzed for crash locations and reasons behind these crashes. Our approach hinges on a careful empirical study which seeks to establish common root-causes for crashes in Android apps, and then distills the remedy of these root-causes in the form of eight generic transformation operators. These operators are applied using a search-based repair framework embodied in our repair tool Droix. We also prepare a benchmark DroixBench capturing reproducible crashes in Android apps. Our evaluation of Droix on DroixBench reveals that the automatically produced patches are often syntactically identical to the human patch, and on some rare occasion even better than the human patch (in terms of avoiding regressions). These results confirm our intuition that our proposed transformations form a sufficient set of operators to patch crashes in Android.

SESSION: Regression testing

Hybrid regression test selection

Regression testing is crucial but can be extremely costly. Regression Test Selection (RTS) aims to reduce regression testing cost by only selecting and running the tests that may be affected by code changes. To date, various RTS techniques analyzing at different granularities (e.g., at the basic-block, method, and file levels) have been proposed. RTS techniques working on finer granularities may be more precise in selecting tests, while techniques working on coarser granularities may have lower overhead. According to a recent study, RTS at the file level (FRTS) can have less overall testing time compared with a finer grained technique at the method level, and represents state-of-the-art RTS. In this paper, we present the first hybrid RTS approach, HyRTS, that analyzes at multiple granularities to combine the strengths of traditional RTS techniques at different granularities. We implemented the basic HyRTS technique by combining the method and file granularity RTS. The experimental results on 2707 revisions of 32 projects, totalling over 124 Million LoC, demonstrate that HyRTS outperforms state-of-the-art FRTS significantly in terms of selected test ratio and the offline testing time. We also studied the impacts of each type of method-level changes, and further designed two new HyRTS variants based on the study results. Our additional experiments show that transforming instance method additions/deletions into file-level changes produces an even more effective HyRTS variant that can significantly outperform FRTS in both offline and online testing time.

Fine-grained test minimization

As a software system evolves, its test suite can accumulate redundancies over time. Test minimization aims at removing redundant test cases. However, current techniques remove whole test cases from the test suite using test adequacy criteria, such as code coverage. This has two limitations, namely (1) by removing a whole test case the corresponding test assertions are also lost, which can inhibit test suite effectiveness, (2) the issue of partly redundant test cases, i.e., tests with redundant test statements, is ignored. We propose a novel approach for fine-grained test case minimization. Our analysis is based on the inference of a test suite model that enables automated test reorganization within test cases. It enables removing redundancies at the test statement level, while preserving the coverage and test assertions of the test suite. We evaluated our approach, implemented in a tool called Testler, on the test suites of 15 open source projects. Our analysis shows that over 4,639 (24%) of the tests in these test suites are partly redundant, with over 11,819 redundant test statements in total. Our results show that Testler removes 43% of the redundant test statements, reducing the number of partly redundant tests by 52%. As a result, test suite execution time is reduced by up to 37% (20% on average), while maintaining the original statement coverage, branch coverage, test assertions, and fault detection capability.

FAST approaches to scalable similarity-based test case prioritization

Many test case prioritization criteria have been proposed for speeding up fault detection. Among them, similarity-based approaches give priority to the test cases that are the most dissimilar from those already selected. However, the proposed criteria do not scale up to handle the many thousands or even some millions test suite sizes of modern industrial systems and simple heuristics are used instead. We introduce the FAST family of test case prioritization techniques that radically changes this landscape by borrowing algorithms commonly exploited in the big data domain to find similar items. FAST techniques provide scalable similarity-based test case prioritization in both white-box and black-box fashion. The results from experimentation on real world C and Java subjects show that the fastest members of the family outperform other black-box approaches in efficiency with no significant impact on effectiveness, and also outperform white-box approaches, including greedy ones, if preparation time is not counted. A simulation study of scalability shows that one FAST technique can prioritize a million test cases in less than 20 minutes.

Towards refactoring-aware regression test selection

Regression testing checks that recent project changes do not break previously working functionality. Although important, regression testing is costly when changes are frequent. Regression test selection (RTS) optimizes regression testing by running only tests whose results might be affected by a change. Traditionally, RTS collects dependencies (e.g., on files) for each test and skips the tests, at a new project revision, whose dependencies did not change. Existing RTS techniques do not differentiate behavior-preserving transformations (i.e., refactorings) from other code changes. As a result, tests are run more frequently than necessary.

We present the first step towards a refactoring-aware RTS technique, dubbed Reks, which skips tests affected only by behavior-preserving changes. Reks defines rules to update the test dependencies without running the tests. To ensure that Reks does not hide any bug introduced by the refactoring engines, we integrate Reks only in the pre-submit testing phase, which happens on the developers' machines. We evaluate Reks by measuring the savings in the testing effort. Specifically, we reproduce 100 refactoring tasks performed by developers of 37 projects on GitHub. Our results show that Reks would not run, on average, 33% of available tests (that would be run by a refactoring-unaware RTS technique). Additionally, we systematically run 27 refactoring types on ten projects. The results, based on 74,160 refactoring tasks, show that Reks would not run, on average, 16% of tests (max: 97% and SD: 24%). Finally, our results show that the Reks update rules are efficient.

SESSION: Open-source systems

Inheritance usage patterns in open-source systems

This research investigates how object-oriented inheritance is actually used in practice. The aim is to close the gap between inheritance guidance and inheritance practice. It is based on detailed analyses of 2440 inheritance hierarchies drawn from 14 open-source systems. The original contributions made by this paper concern pragmatic assessment of inheritance hierarchy design quality. The findings show that inheritance is very widely used but that most of the usage patterns that occur in practice are simple in structure. They are so simple that they may not require much inheritance-specific design consideration. On the other hand, the majority of classes defined using inheritance actually appear within a relatively small number of large, complex hierarchies. While some of these large hierarchies appear to have a consistent structure, often based on a problem domain model or a design pattern, others do not. Another contribution is that the quality of hierarchies, especially the large problematic ones, may be assessed in practice based on size, shape, and the definition and invocation of novel methods - all properties that can be detected automatically.

Almost there: a study on quasi-contributors in open source software projects

Recent studies suggest that well-known OSS projects struggle to find the needed workforce to continue evolving---in part because external developers fail to overcome their first contribution barriers. In this paper, we investigate how and why quasi-contributors (external developers who did not succeed in getting their contributions accepted to an OSS project) fail. To achieve our goal, we collected data from 21 popular, non-trivial GitHub projects, identified quasi-contributors, and analyzed their pull-requests. In addition, we conducted surveys with quasi-contributors, and projects' integrators, to understand their perceptions about nonacceptance. We found 10,099 quasi-contributors --- about 70% of the total actual contributors --- that submitted 12,367 nonaccepted pull-requests. In five projects, we found more quasi-contributors than actual contributors. About one-third of the developers who took our survey disagreed with the nonacceptance, and around 30% declared the nonacceptance demotivated or prevented them from placing another pull-request. The main reasons for pull-request nonacceptance from the quasi-contributors' perspective were "superseded/duplicated pull-request" and "mismatch between developer's and team's vision/opinion." A manual analysis of a representative sample of 263 pull-requests corroborated with this finding. We also found reasons related to the relationship with the community and lack of experience or commitment from the quasi-contributors. This empirical study is particularly relevant to those interested in fostering developers' participation and retention in OSS communities.

Analyzing a decade of Linux system calls

The Linux kernel provides its services to the application layer using so-called system calls. All system calls combined form the Application Programming Interface (API) of the kernel. Hence, system calls provide us with a window into the development process and design decisions that are made for the Linux kernel. Our paper [1] presents the result of an empirical study of the changes (8,770) that were made to the system calls during the last decade (i.e., from April 2005 to December 2014). The main contributions and most important findings of our study are:

(1) An overview of the Linux system calls. As of December 2014, 396 system calls existed in the Linux kernel. They can be categorized into 10 groups (process management, signal processing, and so on). 76 of the system calls were added over the last decade (new system calls). A new system call is usually not activated for all architectures at the same time. 40 out of 76 (53%) new system calls and 102 of the 393 (26%) existing system calls were sibling calls. A sibling call is a system call that is similar in functionality, and often in name, to another system call.

(2) A study of the evolution of the Linux system calls over the last decade in terms of the size and type of changes that were made to the system calls. With an average growth of 25 LOC per day, the Linux system calls are relatively stable. The commits that are made to system calls are slightly more scattered than kernel commits. There exists a small group of very active system call developers. 8,288 of the 8,770 studied commits (95%) were made to maintain, improve and fix bugs in system calls. 36% of the system call-related commits were bug fixes. 4,498 (50%) of the commits were made to only 25 (6%) of the 393 system calls. 35% of the system call-related commits were made to conduct code restructuring, and 36% of the system call-related commits were made to fix bugs.

(3) A study of the type of bug fixes that were made to the system calls over the last decade. Developers make mistakes in the seemingly trivial activation process of a system call. The steps that are required to activate a system call, such as assigning the unique number and updating the system call table, are performed manually. 58% of the bug fix commits were made to fix semantic bugs. The proportion of bug fixes that fixed memory-related bugs remained constant throughout the last decade.

(4) An analysis of the results and a discussion of their implications.

Generalizability of results. We compared Linux system calls with FreeBSD system calls, to validate that our results are generalizable to other UNIX-based operating systems. Our findings for the FreeBSD operating system confirm that other UNIX-based operating systems use a system call mechanism that is similar to that of Linux. Therefore, we can safely assume that our findings are of value to other UNIX-based operating systems.

Suggestion for automation. First, we suggest the automation of simple, reoccurring tasks in Linux, such as adding and removing system calls. Our study on FreeBSD shows that such tasks can successfully be automated. Second, we suggest that historical information about the evolution of a kernel API should be used to guide the testing process. Finally, we suggest that existing automated testing tools are extended to support testing system calls.

Maintenance Effort. Compared to regular software systems, kernel APIs require an additional type of maintenance that involves adding and removing system calls. Also, approximately 11% of the maintenance effort of a kernel API is assigned to the infrastructure for providing the API.

Overall, the results of our study can be beneficial to practitioners, researchers, and more specifically kernel developers, by providing insights related to the challenges and problems that come with long term maintenance of a kernel API, such as the long-lived Linux kernel API. We have published our classification of 8,870 system call-related changes [1] so that it can be used to conduct further studies.

The full paper is accepted for publication in the Empirical Software Engineering journal, and can be found at: https://link.springer.com/article/10.1007/s10664-017-9551-z.

To distribute or not to distribute?: why licensing bugs matter

Software licenses dictate how source code or binaries can be modified, reused, and redistributed. In the case of open source projects, software licenses generally fit into two main categories, permissive and restrictive, depending on the degree to which they allow redistribution or modification under licenses different from the original one(s). Developers and organizations can also modify existing licenses, creating custom licenses with specific permissive/restrictive terms. Having such a variety of software licenses can create confusion among software developers, and can easily result in the introduction of licensing bugs, not necessarily limited to well-known license incompatibilities. In this work, we report a study aimed at characterizing licensing bugs by (i) building a catalog categorizing the types of licensing bugs developers and other stakeholders face, and (ii) understanding the implications licensing bugs have on the software projects they affect. The presented study is the result of the manual analysis of 1,200 discussions related to licensing bugs carried out in issue trackers and in five legal mailing lists of open source communities. Our findings uncover new types of licensing bugs not addressed in prior literature, and a detailed assessment of their implications.

SESSION: Test generation

Augusto: exploiting popular functionalities for the generation of semantic GUI tests with Oracles

Testing software applications by interacting with their graphical user interface (GUI) is an expensive and complex process. Current automatic test case generation techniques implement explorative approaches that, although producing useful test cases, have a limited capability of covering semantically relevant interactions, thus frequently missing important testing scenarios. These techniques typically interact with the available widgets following the structure of the GUI, without any guess about the functions that are executed.

In this paper we propose Augusto, a test case generation technique that exploits a built-in knowledge of the semantics associated with popular and well-known functionalities, such as CRUD operations, to automatically generate effective test cases with automated functional oracles. Empirical results indicate that Augusto can reveal faults that cannot be revealed with state of the art techniques.

Towards optimal concolic testing

Concolic testing integrates concrete execution (e.g., random testing) and symbolic execution for test case generation. It is shown to be more cost-effective than random testing or symbolic execution sometimes. A concolic testing strategy is a function which decides when to apply random testing or symbolic execution, and if it is the latter case, which program path to symbolically execute. Many heuristics-based strategies have been proposed. It is still an open problem what is the optimal concolic testing strategy. In this work, we make two contributions towards solving this problem. First, we show the optimal strategy can be defined based on the probability of program paths and the cost of constraint solving. The problem of identifying the optimal strategy is then reduced to a model checking problem of Markov Decision Processes with Costs. Secondly, in view of the complexity in identifying the optimal strategy, we design a greedy algorithm for approximating the optimal strategy. We conduct two sets of experiments. One is based on randomly generated models and the other is based on a set of C programs. The results show that existing heuristics have much room to improve and our greedy algorithm often outperforms existing heuristics.

DeepTest: automated testing of deep-neural-network-driven autonomous cars

Recent advances in Deep Neural Networks (DNNs) have led to the development of DNN-driven autonomous cars that, using sensors like camera, LiDAR, etc., can drive without any human intervention. Most major manufacturers including Tesla, GM, Ford, BMW, and Waymo/Google are working on building and testing different types of autonomous vehicles. The lawmakers of several US states including California, Texas, and New York have passed new legislation to fast-track the process of testing and deployment of autonomous vehicles on their roads.

However, despite their spectacular progress, DNNs, just like traditional software, often demonstrate incorrect or unexpected corner-case behaviors that can lead to potentially fatal collisions. Several such real-world accidents involving autonomous cars have already happened including one which resulted in a fatality. Most existing testing techniques for DNN-driven vehicles are heavily dependent on the manual collection of test data under different driving conditions which become prohibitively expensive as the number of test conditions increases.

In this paper, we design, implement, and evaluate DeepTest, a systematic testing tool for automatically detecting erroneous behaviors of DNN-driven vehicles that can potentially lead to fatal crashes. First, our tool is designed to automatically generated test cases leveraging real-world changes in driving conditions like rain, fog, lighting conditions, etc. DeepTest systematically explore different parts of the DNN logic by generating test inputs that maximize the numbers of activated neurons. DeepTest found thousands of erroneous behaviors under different realistic driving conditions (e.g., blurring, rain, fog, etc.) many of which lead to potentially fatal crashes in three top performing DNNs in the Udacity self-driving car challenge.

Precise concolic unit testing of C programs using extended units and symbolic alarm filtering

Automated unit testing reduces manual effort to write unit test drivers/stubs and generate unit test inputs. However, automatically generated unit test drivers/stubs raise false alarms because they often over-approximate real contexts of a target function f and allow infeasible executions of f. To solve this problem, we have developed a concolic unit testing technique CONBRIO. To provide realistic context to f, it constructs an extended unit of f that consists of f and closely relevant functions to f. Also, CONBRIO filters out a false alarm by checking feasibility of a corresponding symbolic execution path with regard to f's symbolic calling contexts obtained by combining symbolic execution paths of f's closely related predecessor functions.

In the experiments on the crash bugs of 15 real-world C programs, CONBRIO shows both high bug detection ability (i.e. 91.0% of the target bugs detected) and high precision (i.e. a true to false alarm ratio is 1:4.5). Also, CONBRIO detects 14 new bugs in 9 target C programs studied in papers on crash bug detection techniques.

SESSION: Program reduction techniques

Spatio-temporal context reduction: a pointer-analysis-based static approach for detecting use-after-free vulnerabilities

Zero-day Use-After-Free (UAF) vulnerabilities are increasingly popular and highly dangerous, but few mitigations exist. We introduce a new pointer-analysis-based static analysis, CRed, for finding UAF bugs in multi-MLOC C source code efficiently and effectively. CRed achieves this by making three advances: (i) a spatio-temporal context reduction technique for scaling down soundly and precisely the exponential number of contexts that would otherwise be considered at a pair of free and use sites, (ii) a multi-stage analysis for filtering out false alarms efficiently, and (iii) a path-sensitive demand-driven approach for finding the points-to information required.

We have implemented CRed in LLVM-3.8.0 and compared it with four different state-of-the-art static tools: CBMC (model checking), Clang (abstract interpretation), Coccinelle (pattern matching), and Supa (pointer analysis) using all the C test cases in Juliet Test Suite (JTS) and 10 open-source C applications. For the ground-truth validated with JTS, CRed detects all the 138 known UAF bugs as CBMC and Supa do while Clang and Coccinelle miss some bugs, with no false alarms from any tool. For practicality validated with the 10 applications (totaling 3+ MLOC), CRed reports 132 warnings including 85 bugs in 7.6 hours while the existing tools are either unscalable by terminating within 3 days only for one application (CBMC) or impractical by finding virtually no bugs (Clang and Coccinelle) or issuing an excessive number of false alarms (Supa).

Program splicing

We introduce program splicing, a programming methodology that aims to automate the workflow of copying, pasting, and modifying code available online. Here, the programmer starts by writing a "draft" that mixes unfinished code, natural language comments, and correctness requirements. A program synthesizer that interacts with a large, searchable database of program snippets is used to automatically complete the draft into a program that meets the requirements. The synthesis process happens in two stages. First, the synthesizer identifies a small number of programs in the database that are relevant to the synthesis task. Next it uses an enumerative search to systematically fill the draft with expressions and statements from these relevant programs. The resulting program is returned to the programmer, who can modify it and possibly invoke additional rounds of synthesis.

We present an implementation of program splicing, called Splicer, for the Java programming language. Splicer uses a corpus of over 3.5 million procedures from an open-source software repository. Our evaluation uses the system in a suite of everyday programming tasks, and includes a comparison with a state-of-the-art competing approach as well as a user study. The results point to the broad scope and scalability of program splicing and indicate that the approach can significantly boost programmer productivity.

Chopped symbolic execution

Symbolic execution is a powerful program analysis technique that systematically explores multiple program paths. However, despite important technical advances, symbolic execution often struggles to reach deep parts of the code due to the well-known path explosion problem and constraint solving limitations.

In this paper, we propose chopped symbolic execution, a novel form of symbolic execution that allows users to specify uninteresting parts of the code to exclude during the analysis, thus only targeting the exploration to paths of importance. However, the excluded parts are not summarily ignored, as this may lead to both false positives and false negatives. Instead, they are executed lazily, when their effect may be observable by code under analysis. Chopped symbolic execution leverages various on-demand static analyses at runtime to automatically exclude code fragments while resolving their side effects, thus avoiding expensive manual annotations and imprecision.

Our preliminary results show that the approach can effectively improve the effectiveness of symbolic execution in several different scenarios, including failure reproduction and test suite augmentation.

Perses: syntax-guided program reduction

Given a program P that exhibits a certain property Ψ (e.g., a C program that crashes GCC when it is being compiled), the goal of program reduction is to minimize P to a smaller variant P′ that still exhibits the same property, i.e., Ψ(P′). Program reduction is important and widely demanded for testing and debugging. For example, all compiler/interpreter development projects need effective program reduction to minimize failure-inducing test programs to ease debugging. However, state-of-the-art program reduction techniques --- notably Delta Debugging (DD), Hierarchical Delta Debugging (HDD), and C-Reduce --- do not perform well in terms of speed (reduction time) and quality (size of reduced programs), or are highly customized for certain languages and thus lack generality.

This paper presents Perses, a novel framework for effective, efficient, and general program reduction. The key insight is to exploit, in a general manner, the formal syntax of the programs under reduction and ensure that each reduction step considers only smaller, syntactically valid variants to avoid futile efforts on syntactically invalid variants. Our framework supports not only deletion (as for DD and HDD), but also general, effective program transformations.

We have designed and implemented Perses, and evaluated it for two language settings: C and Java. Our evaluation results on 20 C programs triggering bugs in GCC and Clang demonstrate Perses's strong practicality compared to the state-of-the-art: (1) smaller size --- Perses's results are respectively 2% and 45% in size of those from DD and HDD; and (2) shorter reduction time --- Perses takes 23% and 47% time taken by DD and HDD respectively. Even when compared to the highly customized and optimized C-Reduce for C/C++, Perses takes only 38-60% reduction time.

SESSION: Security, privacy and trust I

Secure coding practices in Java: challenges and vulnerabilities

The Java platform and its third-party libraries provide useful features to facilitate secure coding. However, misusing them can cost developers time and effort, as well as introduce security vulnerabilities in software. We conducted an empirical study on StackOverflow posts, aiming to understand developers' concerns on Java secure coding, their programming obstacles, and insecure coding practices.

We observed a wide adoption of the authentication and authorization features provided by Spring Security---a third-party framework designed to secure enterprise applications. We found that programming challenges are usually related to APIs or libraries, including the complicated cross-language data handling of cryptography APIs, and the complex Java-based or XML-based approaches to configure Spring Security. In addition, we reported multiple security vulnerabilities in the suggested code of accepted answers on the StackOverfow forum. The vulnerabilities included disabling the default protection against Cross-Site Request Forgery (CSRF) attacks, breaking SSL/TLS security through bypassing certificate validation, and using insecure cryptographic hash functions. Our findings reveal the insufficiency of secure coding assistance and documentation, as well as the huge gap between security theory and coding practices.

EnMobile: entity-based characterization and analysis of mobile malware

Modern mobile malware tend to conduct their malicious exploits through sophisticated patterns of interactions that involve multiple entities, e.g., the mobile platform, human users, and network locations. Such malware often evade the detection by existing approaches due to their limited expressiveness and accuracy in characterizing and detecting these malware. To address these issues, in this paper, we recognize entities in the environment of an app, the app's interactions with such entities, and the provenance of these interactions, i.e., the intent and ownership of each interaction, as the key to comprehensively characterizing modern mobile apps, and mobile malware in particular. With this insight, we propose a novel approach named EnMobile including a new entity-based characterization of mobile-app behaviors, and corresponding static analyses, to accurately characterize an app's interactions with entities. We implement EnMobile and provide a practical application of EnMobile in a signature-based scheme for detecting mobile malware. We evaluate EnMobile on a set of 6614 apps consisting of malware from Genome and Drebin along with benign apps from Google Play. Our results show that EnMobile detects malware with substantially higher precision and recall than four state-of-the-art approaches, namely Apposcopy, Drebin, MUDFLOW, and AppContext.

Model comprehension for security risk assessment: an empirical comparison of tabular vs. graphical representations

Context: Tabular and graphical representations are used to communicate security risk assessments for IT systems. However, there is no consensus on which type of representation better supports the comprehension of risks (such as the relationships between threats, vulnerabilities and security controls). Vessey's cognitive fit theory predicts that graphs should be better because they capture spatial relationships. Method: We report the results of two studies performed in two countries with 69 and 83 participants respectively, in which we assessed the effectiveness of tabular and graphical representations concerning the extraction of correct information about security risks. Results: Participants who applied tabular risk models gave more precise and complete answers to the comprehension questions when requested to find simple and complex information about threats, vulnerabilities, or other elements of the risk models. Conclusions: Our findings can be explained by Vessey's cognitive fit theory as tabular models implicitly capture elementary linear spatial relationships. Interest for ICSE: It is almost taken for granted in Software Engineering that graphical-, diagram-based models are "the" way to go (e.g., the SE Body of Knowledge [3]). This paper provides some experimental-based doubts that this might not always be the case. It will provide an interesting debate that might ripple to traditional requirements and design notations outside security.

Privacy by designers: software developers' privacy mindset

Privacy by design (PbD) is a policy measure that calls for embedding privacy into the design of technologies at early stages of the development process and throughout its lifecycle. By introducing privacy considerations into the technological design, PbD delegates responsibility over privacy to those in charge of the design, namely software developers who design information technologies (hereafter called developers). Thus, for PbD to be a viable option, it is important to understand developers' perceptions, interpretation and practices as to informational privacy.

SESSION: Empirical software engineering

Does the propagation of artifact changes across tasks reflect work dependencies?

Developers commonly define tasks to help coordinate software development efforts---whether they be feature implementation, refactoring, or bug fixes. Developers establish links between tasks to express implicit dependencies that needs explicit handling---dependencies that often require the developers responsible for a given task to assess how changes in a linked task affect their own work and vice versa (i.e., change propagation). While seemingly useful, it is unknown if change propagation indeed coincides with task links.

No study has investigated to what extent change propagation actually occurs between task pairs and whether it is able to serve as a metric for characterizing the underlying task dependency. In this paper, we study the temporal relationship between developer reading and changing of source code in relationship to task links We identify seven situations that explain the varying correlation of change propagation with linked task pairs and find six motifs describing when change propagation occurs between non-linked task pairs. Our paper demonstrates that task links are indeed useful for recommending which artifacts to monitor for changes, which developers to involve in a task, or which tasks to inspect.

Large-scale analysis of framework-specific exceptions in Android apps

Mobile apps have become ubiquitous. For app developers, it is a key priority to ensure their apps' correctness and reliability. However, many apps still suffer from occasional to frequent crashes, weakening their competitive edge. Large-scale, deep analyses of the characteristics of real-world app crashes can provide useful insights to guide developers, or help improve testing and analysis tools. However, such studies do not exist --- this paper fills this gap. Over a four-month long effort, we have collected 16,245 unique exception traces from 2,486 open-source Android apps, and observed that framework-specific exceptions account for the majority of these crashes. We then extensively investigated the 8,243 framework-specific exceptions (which took six person-months): (1) identifying their characteristics (e.g., manifestation locations, common fault categories), (2) evaluating their manifestation via state-of-the-art bug detection techniques, and (3) reviewing their fixes. Besides the insights they provide, these findings motivate and enable follow-up research on mobile apps, such as bug detection, fault localization and patch generation. In addition, to demonstrate the utility of our findings, we have optimized Stoat, a dynamic testing tool, and implemented ExLocator, an exception localization tool, for Android apps. Stoat is able to quickly uncover three previously-unknown, confirmed/fixed crashes in Gmail and Google+; ExLocator is capable of precisely locating the root causes of identified exceptions in real-world apps. Our substantial dataset is made publicly available to share with and benefit the community.

Effect sizes and their variance for AB/BA crossover design studies

We addressed the issues related to repeated measures experimental design such as an AB/BA crossover design that have been neither discussed nor addressed in the software engineering literature.

Firstly, there are potentially two different standardized mean difference effect sizes that can be calculated, depending on whether the mean difference is standardized by the pooled within groups variance or the within-participants variance. Hence, we provided equations for non-standardized and standardized effect sizes and explained the need for two different types of standardized effect size, one for the repeated measures and one that would be equivalent to an independent groups design.

Secondly, as for any estimated parameters and also for the purposes of undertaking meta-analysis, it is necessary to calculate the variance of the standardized mean difference effect sizes (which is not the same as the variance of the study). Hence, we provided formulas for the small sample size effect size variance and the medium sample size approximation to the effect size variance, for both types of standardized effect size.

We also presented the model underlying the AB/BA crossover design and provided two examples (an empirical analysis of the real data set by Scanniello, as well as simulated data) to demonstrate how to construct the two standardized mean difference effect sizes and their variances, both from standard descriptive statistics and from the outputs provided by the linear mixed model package lme4 in R.

A conclusion is that crossover designs should be considered (instead of between groups design) only if:

• previous research has suggested that ρ is greater than zero and preferably greater than 0.25;

• there is either strong theoretical argument, or empirical evidence from a well-powered study, that the period by technique interaction is negligible.

Summarizing, our journal first paper [3]:

(1) Presents the formulas needed to calculate both non-standardized and standardized mean difference effect sizes for AB/BA crossover designs (see Section 4 and 5 of our paper [3]).

(2) Presents the formulas needed to estimate the variances of the non-standardized and standardized effect sizes which in the later cases need to be appropriate for the small to medium sample sizes commonly used in software engineering crossover designs (see Section 5 of our paper [3]).

(3) Explains how to calculate the effect sizes and their variances both from the descriptive statistics that should be reported and from the raw data (see Section 6 of our paper [3]).

It is worth mentioning that we based our formulas on our own corrections to the formulas presented earlier by Curtin et al. [1]. Our corrections for the variances of standardized weighted mean difference of an AB/BA cross-over trial were accepted by the author of the original formulas (Curtin), submitted jointly as a letter to Editor of Statistics in Medicine to assure the widespread (also beyond the software engineering domain) adoption of the corrected formulas, and accepted [2]. We proposed an alternative formulation of the standardized effect size for individual difference effects that is comparable with the standardized effect size commonly used for pretest/posttest studies. We also corrected the small sample size and moderate sample size variances reported by Curtin et al. for both the individual difference effect size and the standardized effect size comparable to independent groups trials, showing the derivation of the formulas from the variance of a t-variable. Using these results, researchers can now correctly calculate standardized effect size variances, allowing the calculation of confidence intervals for AB/BA cross-over trials, which in turn provides a direct link to null hypothesis testing and supports meta-analysis. Meta-analysts can now validly aggregate together results from independent groups, pretest/posttest and AB/BA cross-over trials. Last but not least, the presented contributions allow corrections of previously reported results.

A large-scale empirical study on the effects of code obfuscations on Android apps and anti-malware products

The Android platform has been the dominant mobile platform in recent years resulting in millions of apps and security threats against those apps. Anti-malware products aim to protect smartphone users from these threats, especially from malicious apps. However, malware authors use code obfuscation on their apps to evade detection by anti-malware products. To assess the effects of code obfuscation on Android apps and anti-malware products, we have conducted a large-scale empirical study that evaluates the effectiveness of the top anti-malware products against various obfuscation tools and strategies. To that end, we have obfuscated 3,000 benign apps and 3,000 malicious apps and generated 73,362 obfuscated apps using 29 obfuscation strategies from 7 open-source, academic, and commercial obfuscation tools. The findings of our study indicate that (1) code obfuscation significantly impacts Android anti-malware products; (2) the majority of anti-malware products are severely impacted by even trivial obfuscations; (3) in general, combined obfuscation strategies do not successfully evade anti-malware products more than individual strategies; (4) the detection of anti-malware products depend not only on the applied obfuscation strategy but also on the leveraged obfuscation tool; (5) anti-malware products are slow to adopt signatures of malicious apps; and (6) code obfuscation often results in changes to an app's semantic behaviors.

An empirical study on the interplay between semantic coupling and co-change of software classes

The evolution of software systems is an inevitable process which has to be managed effectively to enhance software quality. Change impact analysis (CIA) is a technique that identifies impact sets, i.e., the set of classes that require correction as a result of a change made to a class or artefact. These sets can also be considered as ripple effects and typically non-local: changes propagate to different parts of a system.

Two classes are considered logically coupled if they have co-changed in the past; past research has shown that the precision of CIA techniques increases if logical and semantic coupling (i.e., the extent to which the lexical content of two classes is related) are both considered. However, the relationship between semantic and logical coupling of software artefacts has not been extensively studied and no dependencies established between these two types of coupling. Are two often co-changed artefacts also strongly connected from a semantic point of view? Are two semantically similar artefacts bound to co-change in the future? Answering those questions would help increase the precision of CIA. It would also help software maintainers to focus on a smaller subset of artefacts more likely to co-evolve in the future.

This study investigated the relationship between semantic and logical coupling. Using Chi-squared statistical tests, we identified similarities in semantic coupling using class corpora and class identifiers. We then computed Spearman's rank correlation between semantic and logical coupling metrics for class pairs to detect whether semantic and logical relationships co-varied in OO software. Finally, we investigated the overlap between semantic and logical relationships by identifying the proportion of classes linked through both coupling types. Our empirical study and results were based on seventy-nine open-source software projects. Results showed that: (a) measuring the semantic similarity of classes by using their identifiers is computationally efficient; (b) using identifier-based coupling can be used interchangeably with semantic similarity based on their corpora, albeit not always; (c) no correlation between the strengths of semantic and change coupling was found. Finally, (d) a directional relationship between the two was identified; 70% of semantic dependencies are linked through change coupling but not vice versa.

Based on our findings, we conclude that identifying more efficient methods of semantic coupling computation as well as a directional relationship between semantic and change dependencies could help to improve CIA methods that integrate semantic coupling information. This may also help to reveal implicit dependencies not captured by static source code analysis.

SESSION: Test improvement

DeFlaker: automatically detecting flaky tests

Developers often run tests to check that their latest changes to a code repository did not break any previously working functionality. Ideally, any new test failures would indicate regressions caused by the latest changes. However, some test failures may not be due to the latest changes but due to non-determinism in the tests, popularly called flaky tests. The typical way to detect flaky tests is to rerun failing tests repeatedly. Unfortunately, rerunning failing tests can be costly and can slow down the development cycle.

We present the first extensive evaluation of rerunning failing tests and propose a new technique, called DeFlaker, that detects if a test failure is due to a flaky test without rerunning and with very low runtime overhead. DeFlaker monitors the coverage of latest code changes and marks as flaky any newly failing test that did not execute any of the changes. We deployed DeFlaker live, in the build process of 96 Java projects on TravisCI, and found 87 previously unknown flaky tests in 10 of these projects. We also ran experiments on project histories, where DeFlaker detected 1, 874 flaky tests from 4, 846 failures, with a low false alarm rate (1.5%). DeFlaker had a higher recall (95.5% vs. 23%) of confirmed flaky tests than Maven's default flaky test detector.

DetReduce: minimizing Android GUI test suites for regression testing

In recent years, several automated GUI testing techniques for Android apps have been proposed. These tools have been shown to be effective in achieving good test coverage and in finding bugs without human intervention. Being automated, these tools typically run for a long time (say, for several hours), either until they saturate test coverage or until a testing time budget expires. Thus, these automated tools are not good at generating concise regression test suites that could be used for testing in incremental development of the apps and in regression testing.

We propose a heuristic technique that helps create a small regression test suite for an Android app from a large test suite generated by an automated Android GUI testing tool. The key insight behind our technique is that if we can identify and remove some common forms of redundancies introduced by existing automated GUI testing tools, then we can drastically lower the time required to minimize a GUI test suite. We have implemented our algorithm in a prototype tool called DetReduce. We applied DetReduce to several Android apps and found that DetReduce reduces a test-suite by an average factor of 16.9× in size and 14.7× in running time. We also found that for a test suite generated by running SwiftHand and a randomized test generation algorithm for 8 hours, DetReduce minimizes the test suite in an average of 14.6 hours.

Time to clean your test objectives

Testing is the primary approach for detecting software defects. A major challenge faced by testers lies in crafting efficient test suites, able to detect a maximum number of bugs with manageable effort. To do so, they rely on coverage criteria, which define some precise test objectives to be covered. However, many common criteria specify a significant number of objectives that occur to be infeasible or redundant in practice, like covering dead code or semantically equal mutants. Such objectives are well-known to be harmful to the design of test suites, impacting both the efficiency and precision of the tester's effort. This work introduces a sound and scalable technique to prune out a significant part of the infeasible and redundant objectives produced by a panel of white-box criteria. In a nutshell, we reduce this task to proving the validity of logical assertions in the code under test. The technique is implemented in a tool that relies on weakest-precondition calculus and SMT solving for proving the assertions. The tool is built on top of the Frama-C verification platform, which we carefully tune for our specific scalability needs. The experiments reveal that the pruning capabilities of the tool can reduce the number of targeted test objectives in a program by up to 27% and scale to real programs of 200K lines, making it possible to automate a painstaking part of their current testing process.

Prioritizing browser environments for web application test execution

When testing client-side web applications, it is important to consider different web-browser environments. Different properties of these environments such as web-browser types and underlying platforms may cause a web application to exhibit different types of failures. As web applications evolve, they must be regression tested across these different environments. Because there are many environments to consider this process can be expensive, resulting in delayed feedback about failures in applications. In this work, we propose six techniques for providing a developer with faster feedback on failures when regression testing web applications across different web-browser environments. Our techniques draw on methods used in test case prioritization; however, in our case we prioritize web-browser environments, based on information on recent and frequent failures. We evaluated our approach using four non-trivial and popular open-source web applications. Our results show that our techniques outperform two baseline methods, namely, no ordering and random ordering, in terms of the cost-effectiveness. The improvement rates ranged from −12.24% to 39.05% for no ordering, and from −0.04% to 45.85% for random ordering.

SESSION: Empirical studies of code

An empirical study of early access games on the steam platform

"Early access" is a release strategy for software that allows consumers to purchase an unfinished version of the software. In turn, consumers can influence the software development process by giving developers early feedback. This early access model has become increasingly popular through digital distribution platforms, such as Steam which is the most popular distribution platform for games. The plethora of options offered by Steam to communicate between developers and game players contribute to the popularity of the early access model.

The early access model made a name for itself through several successful games, such as the DayZ game. The multiplayer survival-based game reached 400,000 sales during its first week as an early access game. However, the benefits of the early access model have been questioned as well. For instance, the Spacebase DF-9 game abandoned the early access stage unexpectedly, disappointing many players of the game. Shortly after abandoning the early access stage and terminating the development, twelve employees were laid off including the programmer and project lead.

In this paper [1], we conduct an empirical study on 1,182 Early Access Games (EAGs) on the Steam platform to understand the characteristics, advantages and limitations of the early access model. We find that 15% of the games on Steam make use of the early access model, with the most popular EAG having as many as 29 million owners. 88% of the EAGs are classified by their developers as so-called "indie" games, indicating that most EAGs are developed by individual developers or small studios.

We study the interaction between players and developers of EAGs and the Steam platform. We observe that on the one hand, developers update their games more frequently in the early access stage. On the other hand, the percentage of players that review a game during its early access stage is lower than the percentage of players that review the game after it leaves the early access stage. However, the average rating of the reviews is much higher during the early access stage, suggesting that players are more tolerant of imperfections in the early access stage. The positive review rate does not correlate with the length of the early access stage nor with the game update frequency during the early access stage.

In addition, we discuss several learned lessons from the failure of an early access game. The main learned lesson from this failure is that communication between the game developer and the players of the EAG is crucial. Players enjoy getting involved in the development of an early access game and they get emotionally involved in the decision-making about the game.

Based on our findings, we suggest game developers to use the early access model as a method for eliciting early feedback and more positive reviews to attract additional new players. In addition, our findings suggest that developers can determine their release schedule without worrying about the length of the early access stage and the game update frequency during the early access stage.

Correctness attraction: a study of stability of software behavior under runtime perturbation

Can the execution of software be perturbed without breaking the correctness of the output? In this paper, we devise a protocol to answer this question from a novel perspective. In an experimental study, we observe that many perturbations do not break the correctness in ten subject programs. We call this phenomenon "correctness attraction". The uniqueness of this protocol is that it considers a systematic exploration of the perturbation space as well as perfect oracles to determine the correctness of the output. To this extent, our findings on the stability of software under execution perturbations have a level of validity that has never been reported before in the scarce related work. A qualitative manual analysis enables us to set up the first taxonomy ever of the reasons behind correctness attraction.

On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation

Code smells were defined as symptoms of poor design choices applied by programmers during the development of a software project [2]. They might hinder the comprehensibility and maintainability of software systems [5]. Similarly to some previous work [3, 4, 6, 7] in this paper we investigate the relationship between the presence of code smells and the software change- and fault-proneness. Specifically, while previous work shows a significant correlation between smells and code change/fault-proneness, the empirical evidence provided so far is still limited because of:

Limited size of previous studies: The study by Khomh et al. [4] was conducted on four open source systems, while the study by D'Ambros et al. [1] was performed on seven systems. Furthermore, the studies by Li and Shatnawi [6], Olbrich et al. [7], and Gatrell and Counsell [3] were conducted considering the change history of only one software project.

Detected smells vs. manually validated smells: Previouswork studying the impact of code smells on change- and fault-proneness relied on data obtained from automatic smell detectors, whose imprecisions might have affected the results. Lack of analysis of the magnitude: Previouswork indicated that some smells can be more harmful than others, but the analysis did not take into account the magnitude of the observed phenomenon. For example, even if a specific smell type may be considered harmful when analyzing its impact on maintainability, this may not be relevant in case the number of occurrences of such a smell type in software projects is limited.

Lack of analysis of the magnitude of the effect: Previouswork indicated that classes affected by code smells have more chances to exhibit defects (or to undergo changes) than other classes. However, no study has observed the magnitude of such changes and defects, i.e., no study addressed the question: How many defects would exhibit on average a class affected by a code smell as compared to another class affected by a different kind of smell, or not affected by any smell at all?

Lack of within-artifact analysis: A class might be intrinsically change- and/or fault-prone, e.g., because it plays a core role in the system. Hence, the class may be intrinsically "smelly". Instead, there may be classes that become smelly during their lifetime because of maintenance activities. Or else, classes where the smell was removed, possibly because of refactoring activities. For such classes, it is of paramount importance to analyze the change- and fault-proneness of the class during its evolution, in order to better relate the cause (presence of smell) with the possible effect (change- or fault-proneness).

Lack of a temporal relation analysis: While previouswork correlated the presence of code smells with high fault- and changeproneness, one may wonder whether the artifact was smelly when the fault was introduced, or whether the fault was introduced before the class became smelly.

To cope with the aforementioned issues, this paper aims at corroborating previous empirical research on the impact of code smells by analyzing their diffuseness and effect on change- and faultproneness on a total of 395 releases of 30 open source systems, considering 13 different code smell types manually identified. Our results showed that classes affected by code smells tend to be significantly more change- and fault-prone than classes not affected by design problems, however their removal might be not always beneficial for improving source code maintainability.

Accurate and efficient refactoring detection in commit history

Refactoring detection algorithms have been crucial to a variety of applications: (i) empirical studies about the evolution of code, tests, and faults, (ii) tools for library API migration, (iii) improving the comprehension of changes and code reviews, etc. However, recent research has questioned the accuracy of the state-of-the-art refactoring detection tools, which poses threats to the reliability of their application. Moreover, previous refactoring detection tools are very sensitive to user-provided similarity thresholds, which further reduces their practical accuracy. In addition, their requirement to build the project versions/revisions under analysis makes them inapplicable in many real-world scenarios.

To reinvigorate a previously fruitful line of research that has stifled, we designed, implemented, and evaluated RMiner, a technique that overcomes the above limitations. At the heart of RMiner is an AST-based statement matching algorithm that determines refactoring candidates without requiring user-defined thresholds. To empirically evaluate RMiner, we created the most comprehensive oracle to date that uses triangulation to create a dataset with considerably reduced bias, representing 3,188 refactorings from 185 open-source projects. Using this oracle, we found that RMiner has a precision of 98% and recall of 87%, which is a significant improvement over the previous state-of-the-art.

SESSION: Security, privacy and trust II

ENTRUST: engineering trustworthy self-adaptive software with dynamic assurance cases

Software systems are increasingly expected to cope with variable workloads, component failures and other uncertainties through self-adaptation. As such, self-adaptive software has been the subject of intense research over the past decade [3, 4, 9, 10].

The good, the bad and the ugly: a study of security decisions in a cyber-physical systems game

Motivation: The security of any system is a direct consequence of stakeholders' decisions regarding security requirements. Such decisions are taken with varying degrees of expertise, and little is currently understood about how various demographics - security experts, general computer scientists, managers - approach security decisions and the strategies that underpin those decisions. What are the typical decision patterns, the consequences of such patterns and their impact on the security of the system in question? Nor is there any substantial understanding of how the strategies and decision patterns of these different groups contrast. Is security expertise necessarily an advantage when making security decisions in a given context? Answers to these questions are key to understanding the "how" and "why" behind security decision processes.

The Game: In this talk1, we present a tabletop game: Decisions and Disruptions (D-D)2 that tasks a group of players with managing the security of a small utility company while facing a variety of threats. The game is kept short - 2 hours - and simple enough to be played without prior training. A cyber-physical infrastructure, depicted through a Lego® board, makes the game easy to understand and accessible to players from varying backgrounds and security expertise, without being too trivial a setting for security experts.

Key insights: We played D-D with 43 players divided into homogeneous groups: 4 groups of security experts, 4 groups of nontechnical managers and 4 groups of general computer scientists.

• Strategies: Security experts had a strong interest in advanced technological solutions and tended to neglect intelligence gathering, to their own detriment. Managers, too, were technology-driven and focused on data protection while neglecting human factors more than other groups. Computer scientists tended to balance human factors and intelligence gathering with technical solutions, and achieved the best results of the three demographics.

• Decision Processes: Technical experience significantly changes the way players think. Teams with little technical experience had shallow, intuition-driven discussions with few concrete arguments. Technical teams, and the most experienced in particular, had much richer debates, driven by concrete scenarios, anecdotes from experience, and procedural thinking. Security experts showed a high confidence in their decisions - despite some of them having bad consequences - while the other groups tended to doubt their own skills - even when they were playing good games.

• Patterns: A number of characteristic plays were identified, some good (balance between priorities, open-mindedness, and adapting strategies based on inputs that challenge one's pre-conceptions), some bad (excessive focus on particular issues, confidence in charismatic leaders), some ugly ("tunnel vision" syndrome by over-confident players). These patterns are documented in the full paper - showing the virtue of the positive ones, discouraging the negative ones, and inviting the readers to do their own introspection.

Conclusion: Beyond the analysis of the security decisions of the three demographics, there is a definite educational and awareness-raising aspect to D-D (as noted consistently by players in all our subject groups). Game boxes will be brought to the conference for demonstration purposes, and the audience will be invited to experiment with D-D themselves, make their own decisions, and reflect on their own perception of security.

Lightweight, obfuscation-resilient detection and family identification of Android malware

The number of malicious Android apps has been and continues to increase rapidly. These malware can damage or alter other files or settings, install additional applications, obfuscate their behaviors, propagate quickly, and so on. To identify and handle such malware, a security analyst can significantly benefit from identifying the family to which a malicious app belongs rather than only detecting if an app is malicious. To address these challenges, we present a novel machine learning-based Android malware detection and family-identification approach, RevealDroid, that operates without the need to perform complex program analyses or extract large sets of features. RevealDroid's selected features leverage categorized Android API usage, reflection-based features, and features from native binaries of apps. We assess RevealDroid for accuracy, efficiency, and obfuscation resilience using a large dataset consisting of more than 54,000 malicious and benign apps. Our experiments show that RevealDroid achieves an accuracy of 98% in detection of malware and an accuracy of 95% in determination of their families. We further demonstrate RevealDroid's superiority against state-of-the-art approaches. [URL of original paper: https://dl.acm.org/citation.cfm?id=3162625]

Are vulnerabilities discovered and resolved like other defects?

Context: Software defect data has long been used to drive software development process improvement. If security defects (i.e.,vulnerabilities) are discovered and resolved by different software development practices than non-security defects, the knowledge of that distinction could be applied to drive process improvement.

Objective: The goal of this research is to support technical leaders in making security-specific software development process improvements by analyzing the differences between the discovery and resolution of defects versus that of vulnerabilities.

Method: We extend Orthogonal Defect Classification (ODC) [1], a scheme for classifying software defects to support software development process improvement, to study process-related differences between vulnerabilities and defects, creating ODC + Vulnerabilities (ODC+V). We applied ODC+V to classify 583 vulnerabilities and 583 defects across 133 releases of three open-source projects (Firefox, phpMyAdmin, and Chrome).

Results: Compared with defects, vulnerabilities are found later in the development cycle and are more likely to be resolved through changes to conditional logic. In Firefox, vulnerabilities are resolved 33% more quickly than defects. From a process improvement perspective, these results indicate opportunities may exist for more efficient vulnerability detection and resolution.

Figures 1 and 2 present the percentage of defects and vulnerabilities found in each Activity for Firefox and phpMyAdmin, ordered from left to right as a timeline, first by pre-release, then by postrelease. In these projects, pre-release effort in vulnerability and defect detection correlates with pre-release vulnerability and defect resolution.

Conclusion: We found ODC+V's property of associating vulnerability and defect discovery and resolution events with their software development process contexts helpful for gaining insight into three open source software projects. The addition of the Securitylmpact attribute, in particular, brought visibility into when threat types are discovered during the development process. We would expect use of ODC+V (and of base ODC) periodically over time to be helpful for steering software development projects toward their quality assurance goals.

We give our full report in Morrison et al. [2] 1

SESSION: Communities and ecosystems

How modern news aggregators help development communities shape and share knowledge

Many developers rely on modern news aggregator sites such as Reddit and Hacker News to stay up to date with the latest technological developments and trends. In order to understand what motivates developers to contribute, what kind of content is shared, and how knowledge is shaped by the community, we interviewed and surveyed developers that participate on the Reddit programming subreddit and we analyzed a sample of posts on both Reddit and Hacker News. We learned what kind of content is shared in these websites and developer motivations for posting, sharing, discussing, evaluating, and aggregating knowledge on these aggregators, while revealing challenges developers face in terms of how content and participant behavior is moderated. Our insights aim to improve the practices developers follow when using news aggregators, as well as guide tool makers on how to improve their tools. Our findings are also relevant to researchers that study developer communities of practice.

Adding sparkle to social coding: an empirical study of repository badges in the npm ecosystem

In fast-paced, reuse-heavy, and distributed software development, the transparency provided by social coding platforms like GitHub is essential to decision making. Developers infer the quality of projects using visible cues, known as signals, collected from personal profile and repository pages. We report on a large-scale, mixed-methods empirical study of npm packages that explores the emerging phenomenon of repository badges, with which maintainers signal underlying qualities about their projects to contributors and users. We investigate which qualities maintainers intend to signal and how well badges correlate with those qualities. After surveying developers, mining 294,941 repositories, and applying statistical modeling and time-series analyses, we find that non-trivial badges, which display the build status, test coverage, and up-to-dateness of dependencies, are mostly reliable signals, correlating with more tests, better pull requests, and fresher dependencies. Displaying such badges correlates with best practices, but the effects do not always persist.

"Was my contribution fairly reviewed?": a framework to study the perception of fairness in modern code reviews

Modern code reviews improve the quality of software products. Although modern code reviews rely heavily on human interactions, little is known regarding whether they are performed fairly. Fairness plays a role in any process where decisions that affect others are made. When a system is perceived to be unfair, it affects negatively the productivity and motivation of its participants. In this paper, using fairness theory we create a framework that describes how fairness affects modern code reviews. To demonstrate its applicability, and the importance of fairness in code reviews, we conducted an empirical study that asked developers of a large industrial open source ecosystem (OpenStack) what their perceptions are regarding fairness in their code reviewing process. Our study shows that, in general, the code review process in OpenStack is perceived as fair; however, a significant portion of respondents perceive it as unfair. We also show that the variability in the way they prioritize code reviews signals a lack of consistency and the existence of bias (potentially increasing the perception of unfairness). The contributions of this paper are: (1) we propose a framework---based on fairness theory---for studying and managing social behaviour in modern code reviews, (2) we provide support for the framework through the results of a case study on a large industrial-backed open source project, (3) we present evidence that fairness is an issue in the code review process of a large open source ecosystem, and, (4) we present a set of guidelines for practitioners to address unfairness in modern code reviews.

Collaborative model-driven software engineering: a classification framework and a research map

This proposal is about a study we recently published in the IEEE Transaction of Software Engineering journal [4].

Context: Collaborative software engineering (CoSE) deals with methods, processes and tools for enhancing collaboration, communication, and co-ordination (3C) among team members [5]. CoSE can be employed to conceive different kinds of artifacts during the development and evolution of software systems. For instance, when focusing on software design, multiple stakeholders with different expertise and responsibility collaborate on the system design. Model-Driven Software Engineering (MDSE) provides suitable techniques and tools for specifying, manipulating, and analyzing modeling artifacts including metamodels, models, and transformations [1]. Collaborative MDSE consists of methods or techniques in which multiple stakeholders manage, collaborate, and are aware of each others' work on a set of shared models. A collaborative MDSE approach is composed of three main complementary dimensions: (i) a model management infrastructure for managing the life cycle of the models, (ii) a set of collaboration means for allowing involved stakeholders to work on the modelling artifacts collaboratively, and (iii) a set of communication means for allowing involved stakeholders to exchange, share, and communicate information within the team. Collaborative MDSE is attracting several research efforts from different research areas (e.g., model-driven engineering, global software engineering, etc.), resulting in a variegated scientific body of knowledge on the topic.

Objective: In this study we aim at identifying, classifying, and understanding existing collaborative MDSE approaches. More specifically, our goal is to assess (i) the key characteristics of collaborative MDSE approaches (e.g., model editing environments, model versioning mechanisms, model repositories, support for communication and decision making), (ii) their faced challenges and limitations, and (iii) the interest of researchers in collaborative MDSE approaches over time and their focus on the three dimensions of collaborative MDSE.

Method: In order to achieve this, we designed and conducted a systematic mapping study [6] on collaborative MDSE. Starting from over 3,000 potentially relevant studies, we applied a rigorous selection procedure [3] resulting in 106 selected papers, further clustered into 48 primary studies, along a time span of nineteen years. A suitable classification framework has been empirically defined and rigorously applied for extracting key information from each selected study. We collated, summarized, and analyzed extracted data by applying scientifically sound data synthesis techniques.

Results: In addition to a number of specific insights, our analysis revealed the following key findings: (i) there is a growing scientific interest on collaborative MDSE in the last years; (ii) multi-view modeling, validation support, reuse, and branching are more rarely covered with respect to other aspects about collaborative MDSE; (iii) different primary studies focus differently on individual dimensions of collaborative MDSE (i.e., model management, collaboration, and communication); (iv) most approaches are language-specific, with a prominence of UML-based approaches; (v) few approaches support the interplay between synchronous and asynchronous collaboration.

Conclusion: This study gives a solid foundation for a thorough identification and comparison of existing and future approaches for collaborative MDSE [2]. Those results can be used by both researchers and practitioners for identifying existing research/technical gaps to attack, better scoping their own contributions to the field, or better understanding or refining existing ones.

SESSION: Testing I

ChangeLocator: locate crash-inducing changes based on crash reports

Software crashes are severe manifestations of software bugs. Debugging crashing bugs is tedious and time-consuming. Understanding software changes that induce a crashing bug can provide useful contextual information for bug fixing and is highly demanded by developers. Locating the bug inducing changes is also useful for automatic program repair, since it narrows down the root causes and reduces the search space of bug fix location. However, currently there are no systematic studies on locating the software changes to a source code repository that induce a crashing bug reflected by a bucket of crash reports. To tackle this problem, we first conducted an empirical study on characterizing the bug inducing changes for crashing bugs (denoted as crash-inducing changes). We also propose ChangeLocator, a method to automatically locate crash-inducing changes for a given bucket of crash reports. We base our approach on a learning model that uses features originated from our empirical study and train the model using the data from the historical fixed crashes. We evaluated ChangeLocator with six release versions of Netbeans project. The results show that it can locate the crash-inducing changes for 44.7%, 68.5%, and 74.5% of the bugs by examining only top 1, 5 and 10 changes in the recommended list, respectively. It significantly outperforms the existing state-of-the-art approach.

Are mutation scores correlated with real fault detection?: a large scale empirical study on the relationship between mutants and real faults

Empirical validation of software testing studies is increasingly relying on mutants. This practice is motivated by the strong correlation between mutant scores and real fault detection that is reported in the literature. In contrast, our study shows that correlations are the results of the confounding effects of the test suite size. In particular, we investigate the relation between two independent variables, mutation score and test suite size, with one dependent variable the detection of (real) faults. We use two data sets, CoreBench and Defects4J, with large C and Java programs and real faults and provide evidence that all correlations between mutation scores and real fault detection are weak when controlling for test suite size. We also find that both independent variables significantly influence the dependent one, with significantly better fits, but overall with relative low prediction power. By measuring the fault detection capability of the top ranked, according to mutation score, test suites (opposed to randomly selected test suites of the same size), we find that achieving higher mutation scores improves significantly the fault detection. Taken together, our data suggest that mutants provide good guidance for improving the fault detection of test suites, but their correlation with fault detection are weak.

Efficient sampling of SAT solutions for testing

In software and hardware testing, generating multiple inputs which satisfy a given set of constraints is an important problem with applications in fuzz testing and stimulus generation. However, it is a challenge to perform the sampling efficiently, while generating a diverse set of inputs which satisfy the constraints. We developed a new algorithm QuickSampler which requires a small number of solver calls to produce millions of samples which satisfy the constraints with high probability. We evaluate QuickSampler on large real-world benchmarks and show that it can produce unique valid solutions orders of magnitude faster than other state-of-the-art sampling tools, with a distribution which is reasonably close to uniform in practice.

Are fix-inducing changes a moving target?: a longitudinal case study of just-in-time defect prediction

Change-level defect prediction [5], a.k.a., Just-In-Time (JIT) defect prediction [1], is an alternative to module-level defect prediction that offers several advantages. First, since code changes are often smaller than modules (e.g., classes), JIT predictions are made at a finer granularity, which localizes the inspection process. Second, while modules have a group of authors, changes have only one, which makes triaging JIT predictions easier. Finally, unlike module level prediction, JIT models can scan changes as they are being produced, which means that problems can be investigated while design decisions are still fresh in the developers' minds.

Despite the advantages of JIT defect prediction, like all prediction models, they assume that the properties of past events (fix-inducing changes) are similar to the properties of future ones. This assumption may not hold---the properties of fix-inducing changes in one time period may be different from those of another period. In our paper [4], we set out to address the following central question:

Do the important properties of fix-inducing changes remain consistent as systems evolve?

To address our central question, we train JIT models using six families of code change properties, which are primarily derived from prior studies [1-3, 5]. These properties measure: (a) the magnitude of the change (Size); (b) the dispersion of the changes across modules (Diffusion); (c) the defect proneness of prior changes to the modified modules (History); (d) the experience of the author (Auth. Exp.) and (e) code reviewer(s) (Rev. Exp.); and (f) the amount of participation in the review of the code change (Review).

Through a longitudinal case study of 37,524 changes from the rapidly evolving Qt and OpenStack systems, we find that the answer to our central question is no:

• JIT models lose a large proportion of their discriminatory power (AUC) and calibration (Brier) scores one year after being trained.

• The magnitude of the importance scores of code change properties fluctuate as systems evolve (e.g., Figure 1 shows fluctuations across six-month periods of OpenStack).

• These fluctuations can lead to consistent overestimates (and underestimates) of the future impact of the studied families of code change properties.

To mitigate the impact on model performance, researchers and practitioners should add recently accumulated data to the training set and retrain JIT models to contain fresh data from within the last three months. To better calibrate quality improvement plans (which are based on interpretation of the importance scores of code change properties), researchers and practitioners should put a greater emphasis on larger caches of data, which contain at least six months worth of data, to smooth the effect of spikes and troughs in the importance of properties of fix-inducing changes.

SESSION: Studying software engineers I

Understanding developers' needs on deprecation as a language feature

Deprecation is a language feature that allows API producers to mark a feature as obsolete. We aim to gain a deep understanding of the needs of API producers and consumers alike regarding deprecation. To that end, we investigate why API producers deprecate features, whether they remove deprecated features, how they expect consumers to react, and what prompts an API consumer to react to deprecation. To achieve this goal we conduct semi-structured interviews with 17 third-party Java API producers and survey 170 Java developers. We observe that the current deprecation mechanism in Java and the proposal to enhance it does not address all the needs of a developer. This leads us to propose and evaluate three further enhancements to the deprecation mechanism.

On the dichotomy of debugging behavior among programmers

Debugging is an inevitable activity in most software projects, often difficult and more time-consuming than expected, giving it the nickname the "dirty little secret of computer science." Surprisingly, we have little knowledge on how software engineers debug software problems in the real world, whether they use dedicated debugging tools, and how knowledgeable they are about debugging. This study aims to shed light on these aspects by following a mixed-methods research approach. We conduct an online survey capturing how 176 developers reflect on debugging. We augment this subjective survey data with objective observations on how 458 developers use the debugger included in their integrated development environments (IDEs) by instrumenting the popular Eclipse and IntelliJ IDEs with the purpose-built plugin WatchDog 2.0. To clarify the insights and discrepancies observed in the previous steps, we followed up by conducting interviews with debugging experts and regular debugging users. Our results indicate that IDE-provided debuggers are not used as often as expected, as "printf debugging" remains a feasible choice for many programmers. Furthermore, both knowledge and use of advanced debugging features are low. These results call to strengthen hands-on debugging experience in computer science curricula and have already refined the implementation of modern IDE debuggers.

Measuring program comprehension: a large-scale field study with professionals

During software development and maintenance, developers spend a considerable amount of time on program comprehension. Previous studies show that program comprehension takes up as much as half of a developer's time. However, most of these studies are performed in a controlled setting, or with a small number of participants, and investigate the program comprehension activities only within the IDEs. However, developers' program comprehension activities go well beyond their IDE interactions.

In this paper [1], we perform a more realistic investigation of program comprehension activities. To do this, we extend our ActivitySpace framework to collect and analyze Human-Computer Interaction (HCI) data across many applications (not just the IDEs). We collect 3,148 working hour data from 78 professional developers in a field study. We follow Minelli et al.'s approach to assign developers' activities into four categories: navigation, editing, comprehension, and other. Then we measure comprehension time by calculating the time that developers spend on program comprehension. We find that on average developers spend ~ 58% of their time on program comprehension activities, and that they frequently use web browsers and document editors to perform program comprehension activities. We also investigate the impact of programming language, developers' experience, and project phase on the time that is spent on program comprehension.

Data scientists in software teams: state of the art and challenges

The demand for analyzing large scale telemetry, machine, and quality data is rapidly increasing in software industry. Data scientists are becoming popular within software teams. For example, Face-book, LinkedIn and Microsoft are creating a new career path for data scientists. In this paper, we present a large-scale survey with 793 professional data scientists at Microsoft to understand their educational background, problem topics that they work on, tool usages, and activities. We cluster these data scientists based on the time spent for various activities and identify 9 distinct clusters of data scientists and their corresponding characteristics. We also discuss the challenges that they face and the best practices they share with other data scientists. Our study finds several trends about data scientists in the software engineering context at Microsoft, and should inform managers on how to leverage data science capability effectively within their teams.

SESSION: Program analysis I

Dataflow tunneling: mining inter-request data dependencies for request-based applications

Request-based applications, e.g., most server-side applications, expose services to users in a request-based paradigm, in which requests are served by request-handler methods. An important task for request-based applications is inter-request analysis, which analyzes request-handler methods that are related by inter-request data dependencies together. However, in the request-based paradigm, data dependencies between related request-handler methods are implicitly established by the underlying frameworks that execute these methods. As a result, existing analysis tools are usually limited to the scope of each single method without the knowledge of dependencies between different methods.

In this paper, we design an approach called dataflow tunneling to capture inter-request data dependencies from concrete application executions and produce data-dependency specifications. Our approach answers two key questions: (1) what request-handler methods have data dependencies and (2) what these data dependencies are. Our evaluation using applications developed with two representative and popular frameworks shows that our approach is general and accurate. We also present a characteristic study and a use case of cache tuning based on the mined specifications. We envision that our approach can provide key information to enable future inter-request analysis techniques.

Launch-mode-aware context-sensitive activity transition analysis

Existing static analyses model activity transitions in Android apps context-insensitively, making it impossible to distinguish different activity launch modes, reducing the pointer analysis precision for an activity's callbacks, and potentially resulting in infeasible activity transition paths. In this paper, we introduce Chime, a launch-mode-aware context-sensitive activity transition analysis that models different instances of an activity class according to its launch mode and the transitions between activities context-sensitively, by working together with an object-sensitive pointer analysis.

Our evaluation shows that our context-sensitive activity transition analysis is more precise than its context-insensitive counterpart in capturing activity transitions, facilitating GUI testing, and improving the pointer analysis precision.

UFO: predictive concurrency use-after-free detection

Use-After-Free (UAF) vulnerabilities are caused by the program operating on a dangling pointer and can be exploited to compromise critical software systems. While there have been many tools to mitigate UAF vulnerabilities, UAF remains one of the most common attack vectors. UAF is particularly difficult to detect in concurrent programs, in which a UAF may only occur with rare thread schedules. In this paper, we present a novel technique, UFO, that can precisely predict UAFs based on a single observed execution trace with a provably higher detection capability than existing techniques with no false positives. The key technical advancement of UFO is an extended maximal thread causality model that captures the largest possible set of feasible traces that can be inferred from a given multithreaded execution trace. By formulating UAF detection as a constraint solving problem atop this model, we can explore a much larger thread scheduling space than classical happens-before based techniques. We have evaluated UFO on several real-world large complex C/C++ programs including Chromium and FireFox. UFO scales to real-world systems with hundreds of millions of events in their execution and has detected a large number of real concurrency UAFs.

Collective program analysis

Popularity of data-driven software engineering has led to an increasing demand on the infrastructures to support efficient execution of tasks that require deeper source code analysis. While task optimization and parallelization are the adopted solutions, other research directions are less explored. We present collective program analysis (CPA), a technique for scaling large scale source code analyses, especially those that make use of control and data flow analysis, by leveraging analysis specific similarity. Analysis specific similarity is about, whether two or more programs can be considered similar for a given analysis. The key idea of collective program analysis is to cluster programs based on analysis specific similarity, such that running the analysis on one candidate in each cluster is sufficient to produce the result for others. For determining analysis specific similarity and clustering analysis-equivalent programs, we use a sparse representation and a canonical labeling scheme. Our evaluation shows that for a variety of source code analyses on a large dataset of programs, substantial reduction in the analysis time can be achieved; on average a 69% reduction when compared to a baseline and on average a 36% reduction when compared to a prior technique. We also found that a large amount of analysis-equivalent programs exists in large datasets.

SESSION: Human and social aspects of computing II

Statistical learning of API fully qualified names in code snippets of online forums

Software developers often make use of the online forums such as StackOverflow (SO) to learn how to use software libraries and their APIs. However, the code snippets in such a forum often contain undeclared, ambiguous, or largely unqualified external references. Such declaration ambiguity and external reference ambiguity present challenges for developers in learning to correctly use the APIs. In this paper, we propose StatType, a statistical approach to resolve the fully qualified names (FQNs) for the API elements in such code snippets. Unlike existing approaches that are based on heuristics, StatType has two well-integrated factors. We first learn from a large training code corpus the FQNs that often co-occur. Then, to derive the FQN for an API name in a code snippet, we use that knowledge and also leverage the context consisting of neighboring API names. To realize those factors, we treat the problem as statistical machine translation from source code with partially qualified names to source code with FQNs of the APIs. Our empirical evaluation on real-world code and StackOverflow posts shows that StatType achieves very high accuracy with 97.6% precision and 96.7% recall, which is 16.5% relatively higher than the state-of-the-art approach.

When not to comment: questions and tradeoffs with API documentation for C++ projects

Without usable and accurate documentation of how to use an API, developers can find themselves deterred from reusing relevant code. In C++, one place developers can find documentation is in a header file. When information is missing, they may look at the corresponding implementation code. To understand what's missing from C++ API documentation and the factors influencing whether it will be fixed, we conducted a mixed-methods study involving two experience sampling surveys with hundreds of developers at the moment they visited implementation code, interviews with 18 of those developers, and interviews with 8 API maintainers. In many cases, updating documentation may provide only limited value for developers, while requiring effort maintainers don't want to invest. We identify a set of questions maintainers and tool developers should consider when improving API-level documentation.

Deuce: a lightweight user interface for structured editing

We present a structure-aware code editor, called Deuce, that is equipped with direct manipulation capabilities for invoking automated program transformations. Compared to traditional refactoring environments, DEUCE employs a direct manipulation interface that is tightly integrated within a text-based editing workflow. In particular, Deuce draws (i) clickable widgets atop the source code that allow the user to structurally select the unstructured text for subexpressions and other relevant features, and (ii) a lightweight, interactive menu of potential transformations based on the current selections. We implement and evaluate our design with mostly standard transformations in the context of a small functional programming language. A controlled user study with 21 participants demonstrates that structural selection is preferred to a more traditional text-selection interface and may be faster overall once users gain experience with the tool. These results accord with Deuce's aim to provide human-friendly structural interactions on top of familiar text-based editing.

From UI design image to GUI skeleton: a neural machine translator to bootstrap mobile GUI implementation

A GUI skeleton is the starting point for implementing a UI design image. To obtain a GUI skeleton from a UI design image, developers have to visually understand UI elements and their spatial layout in the image, and then translate this understanding into proper GUI components and their compositions. Automating this visual understanding and translation would be beneficial for bootstraping mobile GUI implementation, but it is a challenging task due to the diversity of UI designs and the complexity of GUI skeletons to generate. Existing tools are rigid as they depend on heuristically-designed visual understanding and GUI generation rules. In this paper, we present a neural machine translator that combines recent advances in computer vision and machine translation for translating a UI design image into a GUI skeleton. Our translator learns to extract visual features in UI images, encode these features' spatial layouts, and generate GUI skeletons in a unified neural network framework, without requiring manual rule development. For training our translator, we develop an automated GUI exploration method to automatically collect large-scale UI data from real-world applications. We carry out extensive experiments to evaluate the accuracy, generality and usefulness of our approach.


When testing meets code review: why and how developers review tests

Automated testing is considered an essential process for ensuring software quality. However, writing and maintaining high-quality test code is challenging and frequently considered of secondary importance. For production code, many open source and industrial software projects employ code review, a well-established software quality practice, but the question remains whether and how code review is also used for ensuring the quality of test code. The aim of this research is to answer this question and to increase our understanding of what developers think and do when it comes to reviewing test code. We conducted both quantitative and qualitative methods to analyze more than 300,000 code reviews, and interviewed 12 developers about how they review test files. This work resulted in an overview of current code reviewing practices, a set of identified obstacles limiting the review of test code, and a set of issues that developers would like to see improved in code review tools. The study reveals that reviewing test files is very different from reviewing production files, and that the navigation within the review itself is one of the main issues developers currently face. Based on our findings, we propose a series of recommendations and suggestions for the design of tools and future research.

Redefining prioritization: continuous prioritization for continuous integration

Continuous integration (CI) development environments allow software engineers to frequently integrate and test their code. While CI environments provide advantages, they also utilize non-trivial amounts of time and resources. To address this issue, researchers have adapted techniques for test case prioritization (TCP) to CI environments. To date, however, the techniques considered have operated on test suites, and have not achieved substantial improvements. Moreover, they can be inappropriate to apply when system build costs are high. In this work we explore an alternative: prioritization of commits. We use a lightweight approach based on test suite failure and execution history that is highly efficient; our approach "continuously" prioritizes commits that are waiting for execution in response to the arrival of each new commit and the completion of each previously scheduled commit. We have evaluated our approach on three non-trivial CI data sets. Our results show that our approach can be more effective than prior techniques.

MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction

This study presents MAHAKIL, a novel and efficient synthetic over-sampling approach for software defect datasets that is based on the chromosomal theory of inheritance. Exploiting this theory, MAHAKIL interprets two distinct sub-classes as parents and generates a new instance that inherits different traits from each parent and contributes to the diversity within the data distribution. We extensively compare MAHAKIL with five other sampling approaches using 20 releases of defect datasets from the PROMISE repository and five prediction models. Our experiments indicate that MAHAKIL improves the prediction performance for all the models and achieves better and more significant pf values than the other oversampling approaches, based on robust statistical tests.

On the use of hidden Markov model to predict the time to fix bugs

A significant amount of time is spent by software developers in investigating bug reports. It is useful to indicate when a bug report will be closed, since it would help software teams to prioritise their work. Several studies have been conducted to address this problem in the past decade. Most of these studies have used the frequency of occurrence of certain developer activities as input attributes in building their prediction models. However, these approaches tend to ignore the temporal nature of the occurrence of these activities. In this paper, a novel approach using Hidden Markov models (HMMs) and temporal sequences of developer activities is proposed. The approach is empirically demonstrated in a case study using eight years of bug reports collected from the Firefox project. We provide additional details below. In a software bug repository, recorded developer activities occur sequentially. For example, activity C (a certain person has been copied on the bug report) is followed by activity A (bug confirmed and assigned to a named developer), which in turn is followed by activity Z (bug reached status resolved). Additional piece of information is developers' level of expertise, such as novice (N), intermediate (M), or experienced (E), at the time of report creation. We combine these data together to produce a sequence of temporal activities associated with bug reports in the Firefox bug repository.

SESSION: Studying software engineers II

What makes a great manager of software engineers?

Having great managers is as critical to success as having a good team or organization. A great manager is seen as fuelling the team they manage, enabling it to use its full potential. Though software engineering research studies factors that may affect the performance and productivity of software engineers and teams (like tools and skill), it has overlooked the software engineering manager. On the one hand, experts are questioning how the abundant work in management applies to software engineering. On the other hand, practitioners are looking to researchers for evidence-based guidance on how to manage software teams. We conducted a mixed methods empirical study to investigate what manager attributes developers and engineering managers perceive important and why. We present a conceptual framework of manager attributes, and find that technical skills are not the sign of greatness for an engineering manager. Through statistical analysis we identify how engineers and managers relate in their views, and how software engineering differs from other knowledge work groups.

Older adults and hackathons: a qualitative study

Globally observed trends in aging indicate that older adults constitute a growing share of the population and an increasing demographic in the modern technologies marketplace. Therefore, it has become important to address the issue of participation of older adults in the process of developing solutions suitable for their group. In this study, we approached this topic by organizing a hackathon involving teams of young programmers and older adult participants. In our paper we describe a case study of that hackathon, in which our objective was to motivate older adults to participate in software engineering processes. Based on our results from an array of qualitative methods, we propose a set of good practices that may lead to improved older adult participation in similar events and an improved process of developing apps that target older adults.

Does syntax highlighting help programming novices?

Program comprehension is an important skill for programmers - extending and debugging existing source code is part of the daily routine. Syntax highlighting is one of the most common tools used to support developers in understanding algorithms. However, most research on code highlighting is more than 20 years old, when programmers used a completely different tool chain. Newer results on the effect of syntax highlighting as used in modern Integrated Development Environments (IDEs) are inconclusive.

Do programmers work at night or during the weekend?

Abnormal working hours can reduce work health, general well-being, and productivity, independent from a profession. To inform future approaches for automatic stress and overload detection, this paper establishes empirically collected measures of the work patterns of software engineers. To this aim, we perform the first large-scale study of software engineers' working hours by investigating the time stamps of commit activities of 86 large open source software projects, both containing hired and volunteer developers. We find that two thirds of software engineers mainly follow typical office hours, empirically established to be from 10h to 18h, and do not usually work during nights and weekends. Large variations between projects and individuals exist. Surprisingly, we found no support that project maturation would decrease abnormal working hours. In the Firefox case study, we found that hired developers work more during office hours while seniority, either in terms of number of commits or job status, did not impact working hours. We conclude that the use of working hours or timestamps of work products for stress detection requires establishing baselines at the level of individuals.

SESSION: Program analysis II

Multi-granular conflict and dependency analysis in software engineering based on graph transformation

Conflict and dependency analysis (CDA) of graph transformation has been shown to be a versatile foundation for understanding interactions in many software engineering domains, including software analysis and design, model-driven engineering, and testing. In this paper, we propose a novel static CDA technique that is multi-granular in the sense that it can detect all conflicts and dependencies on multiple granularity levels. Specifically, we provide an efficient algorithm suite for computing binary, coarse-grained, and fine-grained conflicts and dependencies: Binary granularity indicates the presence or absence of conflicts and dependencies, coarse granularity focuses on root causes for conflicts and dependencies, and fine granularity shows each conflict and dependency in full detail. Doing so, we can address specific performance and usability requirements that we identified in a literature survey of CDA usage scenarios. In an experimental evaluation, our algorithm suite computes conflicts and dependencies rapidly. Finally, we present a user study, in which the participants found our coarse-grained results more understandable than the fine-grained ones reported in a state-of-the-art tool. Our overall contribution is twofold: (i) we significantly speed up the computation of fine-grained and binary CDA results and, (ii) complement them with coarse-grained ones, which offer usability benefits for numerous use cases.

Self-hiding behavior in Android apps: detection and characterization

Applications (apps) that conceal their activities are fundamentally deceptive; app marketplaces and end-users should treat such apps as suspicious. However, due to its nature and intent, activity concealing is not disclosed up-front, which puts users at risk. In this paper, we focus on characterization and detection of such techniques, e.g., hiding the app or removing traces, which we call "self hiding behavior" (SHB). SHB has not been studied per se - rather it has been reported on only as a byproduct of malware investigations. We address this gap via a study and suite of static analyses targeted at SH in Android apps. Specifically, we present (1) a detailed characterization of SHB, (2) a suite of static analyses to detect such behavior, and (3) a set of detectors that employ SHB to distinguish between benign and malicious apps. We show that SHB ranges from hiding the app's presence or activity to covering an app's traces, e.g., by blocking phone calls/text messages or removing calls and messages from logs. Using our static analysis tools on a large dataset of 9,452 Android apps (benign as well as malicious) we expose the frequency of 12 such SH behaviors. Our approach is effective: it has revealed that malicious apps employ 1.5 SHBs per app on average. Surprisingly, SH behavior is also employed by legitimate ("benign") apps, which can affect users negatively in multiple ways. When using our approach for separating malicious from benign apps, our approach has high precision and recall (combined F-measure = 87.19%). Our approach is also efficient, with analysis typically taking just 37 seconds per app. We believe that our findings and analysis tool are beneficial to both app marketplaces and end-users.

The scent of a smell: an extensive comparison between textual and structural smells

Code smells, i.e., symptoms of poor design and implementation choices applied by programmers during the development of a software project [2], represent an important factor contributing to technical debt [3]. The research community spent a lot of effort studying the extent to which code smells tend to remain in a software project for long periods of time [9], as well as their negative impact on non-functional properties of source code [4, 7]. As a consequence, several tools and techniques have been proposed to help developers in detecting code smells and to suggest refactoring opportunities (e.g., [5, 6, 8]).

So far, almost all detectors identify code smells using structural properties of source code. However, recent studies have indicated that code smells detected by existing tools are generally ignored (and thus not refactored) by the developers [1]. A possible reason is that developers do not perceive the code smells identified by the tool as actual design problems or, if they do, they are not able to practically work on such code smells. In other words, there is misalignment between what is considered smelly by the tool and what is actually refactorable by developers.

In a previous paper [6], we introduced a tool named TACO that uses textual analysis to detect code smells. The results indicated that textual and structural techniques are complementary: while some code smell instances in a software system can be correctly identified by both TACO and the alternative structural approaches, other instances can be only detected by one of the two [6].

In this paper, we investigate whether code smells detected using textual information are as difficult to identify and refactor as structural smells or if they follow a different pattern during software evolution. We firstly performed a repository mining study considering 301 releases and 183,514 commits from 20 open source projects (i) to verify whether textually and structurally detected code smells are treated differently, and (ii) to analyze their likelihood of being resolved with regards to different types of code changes, e.g., refactoring operations. Since our quantitative study cannot explain relation and causation between code smell types and maintenance activities, we perform a qualitative study with 19 industrial developers and 5 software quality experts in order to understand (i) how code smells identified using different sources of information are perceived, and (ii) whether textually or structurally detected code smells are easier to refactor. In both studies, we focused on five code smell types, i.e., Blob, Feature Envy, Long Method, Misplaced Class, and Promiscuous Package.

The results of our studies indicate that textually detected code smells are perceived as harmful as the structural ones, even though they do not exceed any typical software metrics' value (e.g., lines of code in a method). Moreover, design problems in source code affected by textual-based code smells are easier to identify and refactor. As a consequence, developers' activities tend to decrease the intensity of textual code smells, positively impacting their likelihood of being resolved. Vice versa, structural code smells typically increase in intensity over time, indicating that maintenance operations are not aimed at removing or limiting them. Indeed, while developers perceive source code affected by structural-based code smells as harmful, they face more problems in correctly identifying the actual design problems affecting these code components and/or the right refactoring operation to apply to remove them.

ConflictJS: finding and understanding conflicts between JavaScript libraries

It is a common practice for client-side web applications to build on various third-party JavaScript libraries. Due to the lack of namespaces in JavaScript, these libraries all share the same global namespace. As a result, one library may inadvertently modify or even delete the APIs of another library, causing unexpected behavior of library clients. Given the quickly increasing number of libraries, manually keeping track of such conflicts is practically impossible both for library developers and users. This paper presents ConflictJS, an automated and scalable approach to analyze libraries for conflicts. The key idea is to tackle the huge search space of possible conflicts in two phases. At first, a dynamic analysis of individual libraries identifies pairs of potentially conflicting libraries. Then, targeted test synthesis validates potential conflicts by creating a client application that suffers from a conflict. The overall approach is free of false positives, in the sense that it reports a problem only when such a client exists. We use ConflictJS to analyze and study conflicts among 951 real-world libraries. The results show that one out of four libraries is potentially conflicting and that 166 libraries are involved in at least one certain conflict. The detected conflicts cause crashes and other kinds of unexpected behavior. Our work helps library developers to prevent conflicts, library users to avoid combining conflicting libraries, and provides evidence that designing a language without explicit namespaces has undesirable effects.

SESSION: Software comprehension

Debugging data flows in reactive programs

Reactive Programming is a style of programming that provides developers with a set of abstractions that facilitate event handling and stream processing. Traditional debug tools lack support for Reactive Programming, leading developers to fallback to the most rudimentary debug tool available: logging to the console.

In this paper, we present the design and implementation of RxFiddle, a visualization and debugging tool targeted to Rx, the most popular form of Reactive Programming. RxFiddle visualizes the dependencies and structure of the data flow, as well as the data inside the flow. We evaluate RxFiddle with an experiment involving 111 developers. The results show that RxFiddle can help developers finish debugging tasks faster than with traditional debugging tools.

Do you remember this source code?

Being familiar with the source code of a program comprises knowledge about its purpose, structure, and details. Consequently, familiarity is an important factor in many contexts of software development, especially for maintenance and program comprehension. As a result, familiarity is considered to some extent in many different approaches, for example, to model costs or to identify experts. Still, all approaches we are aware of require a manual assessment of familiarity and empirical analyses of forgetting in software development are missing. In this paper, we address this issue with an empirical study that we conducted with 60 open-source developers. We used a survey to receive information on the developers' familiarity and analyze the responses based on data we extract from their used version control systems. The results show that forgetting is an important factor when considering familiarity and program comprehension of developers. We find that a forgetting curve is partly applicable for software development, investigate three factors - the number of edits, ratio of owned code, and tracking behavior - that can impact familiarity with code, and derive a general memory strength for our participants. Our findings can be used to scope approaches that have to consider familiarity and they provide insights into forgetting in the context of software development.

Inferring hierarchical motifs from execution traces

Program comprehension is a necessary step for performing many software engineering tasks. Dynamic analysis is effective in producing execution traces that assist comprehension. Traces are rich sources of information regarding the behaviour of a program. However, it is challenging to gain insight from traces due to their overwhelming amount of data and complexity. We propose a generic technique for facilitating comprehension by inferring recurring execution motifs. Inspired by bioinformatics, motifs are patterns in traces that are flexible to small changes in execution, and are captured in a hierarchical model. The hierarchical nature of the model provides an overview of the behaviour at a high-level, while preserving the execution details and intermediate levels in a structured manner. We design a visualization that allows developers to observe and interact with the model. We implement our approach in an open-source tool, called Sabalan, and evaluate it through a user experiment. The results show that using Sabalan improves developers' accuracy in performing comprehension tasks by 54%.

A comparison of program comprehension strategies by blind and sighted programmers

Programmers who are blind use a screen reader to speak source code one word at a time, as though the code were text. For example, "float f = 5.23;" can be read as "float f equals five point two three semicolon". This process of reading is in stark contrast to sighted programmers, who skim source code rapidly with their eyes. At present, it is not known whether the difference in these processes has effects on the program comprehension gained from reading code. These effects are important because they could reduce both the usefulness of accessibility tools and the generalizability of software engineering studies to persons with low vision. Furthermore, a lack of knowledge about blind programmers contributes to a bias against employing blind programmers. Employers are unfamiliar with the idea of a blind programmer and as a result may feel unsure about hiring one.

SESSION: Performance and maintenance

Identifying patch correctness in test-based program repair

Test-based automatic program repair has attracted a lot of attention in recent years. However, the test suites in practice are often too weak to guarantee correctness and existing approaches often generate a large number of incorrect patches.

To reduce the number of incorrect patches generated, we propose a novel approach that heuristically determines the correctness of the generated patches. The core idea is to exploit the behavior similarity of test case executions. The passing tests on original and patched programs are likely to behave similarly while the failing tests on original and patched programs are likely to behave differently. Also, if two tests exhibit similar runtime behavior, the two tests are likely to have the same test results. Based on these observations, we generate new test inputs to enhance the test suites and use their behavior similarity to determine patch correctness.

Our approach is evaluated on a dataset consisting of 139 patches generated from existing program repair systems including jGen-Prog, Nopol, jKali, ACS and HDRepair. Our approach successfully prevented 56.3% of the incorrect patches to be generated, without blocking any correct patches.

How not to structure your database-backed web applications: a study of performance bugs in the wild

Many web applications use databases for persistent data storage, and using Object Relational Mapping (ORM) frameworks is a common way to develop such database-backed web applications. Unfortunately, developing efficient ORM applications is challenging, as the ORM framework hides the underlying database query generation and execution. This problem is becoming more severe as these applications need to process an increasingly large amount of persistent data. Recent research has targeted specific aspects of performance problems in ORM applications. However, there has not been any systematic study to identify common performance anti-patterns in real-world such applications, how they affect resulting application performance, and remedies for them.

In this paper, we try to answer these questions through a comprehensive study of 12 representative real-world ORM applications. We generalize 9 ORM performance anti-patterns from more than 200 performance issues that we obtain by studying their bug-tracking systems and profiling their latest versions. To prove our point, we manually fix 64 performance issues in their latest versions and obtain a median speedup of 2× (and up to 39× max) with fewer than 5 lines of code change in most cases. Many of the issues we found have been confirmed by developers, and we have implemented ways to identify other code fragments with similar issues as well.

Speedoo: prioritizing performance optimization opportunities

Performance problems widely exist in modern software systems. Existing performance optimization techniques, including profiling-based and pattern-based techniques, usually fail to consider the architectural impacts among methods that easily slow down the overall system performance. This paper contributes a new approach, named Speedoo, to identify groups of methods that should be treated together and deserve high priorities for performance optimization. The uniqueness of Speedoo is to measure and rank the performance optimization opportunities of a method based on 1) the architectural impact and 2) the optimization potential. For each highly ranked method, we locate a respective Optimization Space based on 5 performance patterns generalized from empirical observations. The top ranked optimization spaces are suggested to developers as potential optimization opportunities. Our evaluation on three real-life projects has demonstrated that 18.52% to 42.86% of methods in the top ranked optimization spaces indeed undertook performance optimization in the projects. This outperforms one of the state-of-the-art profiling tools YourKit by 2 to 3 times. An important implication of this study is that developers should treat methods in an optimization space together as a group rather than as individuals in performance optimization. The proposed approach can provide guidelines and reduce developers' manual effort.

Empirical study on the discrepancy between performance testing results from virtual and physical environments

Large software systems often undergo performance tests to ensure their capability to handle expected loads. These performance tests often consume large amounts of computing resources and time since heavy loads need to be generated. Making it worse, the ever evolving field requires frequent updates to the performance testing environment. In practice, virtual machines (VMs) are widely exploited to provide flexible and less costly environments for performance tests. However, the use of VMs may introduce confounding overhead (e.g., a higher than expected memory utilization with unstable I/O traffic) to the testing environment and lead to unrealistic performance testing results. Yet, little research has studied the impact on test results of using VMs in performance testing activities.

To evaluate the discrepancy between the performance testing results from virtual and physical environments, we perform a case study on two open source systems - namely Dell DVD Store (DS2) and CloudStore. We conduct the same performance tests in both virtual and physical environments and compare the performance testing results based on the three aspects that are typically examined for performance testing results: 1) single performance metric (e.g. CPU Time from virtual environment vs. CPU Time from physical environment), 2) the relationship among performance metrics (e.g. correlation between CPU and I/O) and 3) performance models that are built to predict system performance. Our results show that 1) A single metric from virtual and physical environments do not follow the same distribution, hence practitioners cannot simply use a scaling factor to compare the performance between environments, 2) correlations among performance metrics in virtual environments are different from those in physical environments 3) statistical models built based on the performance metrics from virtual environments are different from the models built from physical environments suggesting that practitioners cannot use the performance testing results across virtual and physical environments. In order to assist the practitioners leverage performance testing results in both environments, we investigate ways to reduce the discrepancy. We find that such discrepancy can be reduced by normalizing performance metrics based on deviance. Overall, we suggest that practitioners should not use the performance testing results from virtual environment with the simple assumption of straightforward performance overhead. Instead, practitioners should consider leveraging normalization techniques to reduce the discrepancy before examining performance testing results from virtual and physical environments.

SESSION: Requirements and recommender systems

The evolution of requirements practices in software startups

We use Grounded Theory to study the evolution of requirements practices of 16 software startups as they grow and introduce new products and services. These startups operate in a dynamic environment, with significant time and market pressure, and rarely have time for systematic requirements analysis. Our theory describes the evolution of practice along six dimensions that emerged as relevant to their requirements activities: requirements artefacts, knowledge management, requirements-related roles, planning, technical debt and product quality. Beyond the relationships among the dimensions, our theory also explains the turning points that drove the evolution along these dimensions. These changes are reactive, rather than planned, suggesting an overall pragmatic lightness, i.e., flexibility, in the startups' evolution towards engineering practices for requirements. Our theory organises knowledge about evolving requirements practice in maturing startups, and provides practical insights for startups' assessing their own evolution as they face challenges to their growth. Our research also suggests that a startup's evolution along the six dimensions is not fundamental to its success, but has significant effects on their product, their employees and the company.

Traceability in the wild: automatically augmenting incomplete trace links

Software and systems traceability is widely accepted as an essential element for supporting many software development tasks. Today's version control systems provide inbuilt features that allow developers to tag each commit with one or more issue ID, thereby providing the building blocks from which project-wide traceability can be established between feature requests, bug fixes, commits, source code, and specific developers. However, our analysis of six open source projects showed that on average only 60% of the commits were linked to specific issues. Without these fundamental links the entire set of project-wide links will be incomplete, and therefore not trustworthy. In this paper we address the fundamental problem of missing links between commits and issues. Our approach leverages a combination of process and text-related features characterizing issues and code changes to train a classifier to identify missing issue tags in commit messages, thereby generating the missing links. We conducted a series of experiments to evaluate our approach against six open source projects and showed that it was able to effectively recommend links for tagging issues at an average of 96% recall and 33% precision. In a related task for augmenting a set of existing trace links, the classifier returned precision at levels greater than 89% in all projects and recall of 50%.

A temporal permission analysis and enforcement framework for Android

Permission-induced attacks, i.e., security breaches enabled by permission misuse, are among the most critical and frequent issues threatening the security of Android devices. By ignoring the temporal aspects of an attack during the analysis and enforcement, the state-of-the-art approaches aimed at protecting the users against such attacks are prone to have low-coverage in detection and high-disruption in prevention of permission-induced attacks. To address this shortcomings, we present Terminator, a temporal permission analysis and enforcement framework for Android. Leveraging temporal logic model checking,Terminator's analyzer identifies permission-induced threats with respect to dynamic permission states of the apps. At runtime, Terminator's enforcer selectively leases (i.e., temporarily grants) permissions to apps when the system is in a safe state, and revokes the permissions when the system moves to an unsafe state realizing the identified threats. The results of our experiments, conducted over thousands of apps, indicate that Terminator is able to provide an effective, yet non-disruptive defense against permission-induced attacks. We also show that our approach, which does not require modification to the Android framework or apps' implementation logic, is highly reliable and widely applicable.

Global-aware recommendations for repairing violations in exception handling

This paper presents an extended abstract incorporated as a journal-first paper into the ICSE'18 program.


RFC-directed differential testing of certificate validation in SSL/TLS implementations

Certificate validation in Secure Socket Layer or Transport Layer Security protocol (SSL/TLS) is critical to Internet security. Thus, it is significant to check whether certificate validation in SSL/TLS is correctly implemented. With this motivation, we propose a novel differential testing approach which is directed by the standard Request For Comments (RFC). First, rules of certificates are extracted automatically from RFCs. Second, low-level test cases are generated through dynamic symbolic execution. Third, high-level test cases, i.e. certificates, are assembled automatically. Finally, with the assembled certificates being test cases, certificate validations in SSL/TLS implementations are tested to reveal latent vulnerabilities or bugs. Our approach named RFCcert has the following advantages: (1) certificates of RFCcert are discrepancy-targeted since they are assembled according to standards instead of genetics; (2) with the obtained certificates, RFCcert not only reveals the invalidity of traditional differential testing but also is able to conduct testing that traditional differential testing cannot do; and (3) the supporting tool of RFCcert has been implemented and extensive experiments show that the approach is effective in finding bugs of SSL/TLS implementations.

Symbolic verification of regular properties

Verifying the regular properties of programs has been a significant challenge. This paper tackles this challenge by presenting symbolic regular verification (SRV) that offers significant speedups over the state-of-the-art. SRV is based on dynamic symbolic execution (DSE) and enabled by novel techniques for mitigating path explosion: (1) a regular property-oriented path slicing algorithm, and (2) a synergistic combination of property-oriented path slicing and guiding. Slicing prunes redundant paths, while guiding boosts the search for counterexamples. We have implemented SRV for Java and evaluated it on 15 real-world open-source Java programs (totaling 259K lines of code). Our evaluation results demonstrate the effectiveness and efficiency of SRV. Compared with the state-of-the-art --- pure DSE, pure guiding, and pure path slicing --- SRV achieves average speedups of more than 8.4X, 8.6X, and 7X, respectively, making symbolic regular property verification significantly more practical.

Metamorphic testing of RESTful web APIs

Web Application Programming Interfaces (APIs) specify how to access services and data over the network, typically using Web services. Web APIs are rapidly proliferating as a key element to foster reusability, integration, and innovation, enabling new consumption models such as mobile or smart TV apps. Companies such as Facebook, Twitter, Google, eBay or Netflix receive billions of API calls every day from thousands of different third-party applications and devices, which constitutes more than half of their total traffic.

As Web APIs are progressively becoming the cornerstone of software integration, their validation is getting more critical. In this context, the fast detection of bugs is of utmost importance to increase the quality of internal products and third-party applications. However, testing Web APIs is challenging mainly due to the difficulty to assess whether the output of an API call is correct, i.e., the oracle problem. For instance, consider the Web API of the popular music streaming service Spotify. Suppose a search for albums with the query "redhouse" returning 21 total matches: Is this output correct? Do all the albums in the result set contain the keyword? Are there any albums containing the keyword not included in the result set? Answering these questions is difficult, even with small result sets, and often infeasible when the results are counted by thousands or millions.

Metamorphic testing alleviates the oracle problem by providing an alternative when the expected output of a test execution is complex or unknown. Rather than checking the output of an individual program execution, metamorphic testing checks whether multiple executions of the program under test fulfil certain necessary properties called metamorphic relations. For instance, consider the following metamorphic relation in Spotify: two searches for albums with the same query should return the same number of total results regardless of the size of pagination. Suppose that a new Spotify search is performed using the exact same query as before and increasing the maximum number of results per page from 20 (default value) to 50: This search returns 27 total albums (6 more matches than in the previous search), which reveals a bug. This is an example of a real and reproducible fault detected using the approach presented in this paper and reported to Spotify. According to Spotify developers, it was a regression fault caused by a fix with undesired side effects.

In this paper [1], we present a metamorphic testing approach for the automated detection of faults in RESTful Web APIs (henceforth also referred to as simply Web APIs). We introduce the concept of metamorphic relation output patterns. A Metamorphic Relation Output Pattern (MROP) defines an abstract output relation typically identified in Web APIs, regardless of their application domain. Each MROP is defined in terms of set operations among test outputs such as equality, union, subset, or intersection. MROPs provide a helpful guide for the identification of metamorphic relations, broadening the scope of our work beyond a particular Web API. Based on the notion of MROP, a methodology is proposed for the application of the approach to any Web API following the REST architectural pattern.

The approach was evaluated in several steps. First, we used the proposed methodology to identify 33 metamorphic relations in four Web APIs developed by undergraduate students. All the relations are instances of the proposed MROPs. Then, we assessed the effectiveness of the identified relations at revealing 317 automatically seeded faults (i.e., mutants) in the APIs under test. As a result, 302 seeded faults were detected, achieving a mutation score of 95.3%. Second, we evaluated the approach using real Web APIs and faults. In particular, we identified 20 metamorphic relations in the Web API of Spotify and 40 metamorphic relations in the Web API of YouTube. Each metamorphic relation was implemented and automatically executed using both random and manual test data. In total, 469K metamorphic tests were generated. As a result, 21 metamorphic relations were violated, and 11 issues revealed and reported (3 issues in Spotify and 8 issues in YouTube). To date, 10 of the reported issues have been either confirmed by the API developers or reproduced by other users supporting the effectiveness of our approach.

Integrating technical debt management and software quality management processes: a framework and field tests

Technical debt, defined as the maintenance obligations arising from shortcuts taken during the design, development, and deployment of software systems, has been shown to significantly impact the reliability and long-term evolution of software systems [1], [2]. Although academic research has moved beyond using technical debt only as a metaphor, and has begun compiling strong empirical evidence on the economic implications of technical debt, industry practitioners continue to find managing technical debt a challenging balancing act [3]. Despite the increasing awareness of the importance of managing technical debt in software product development, systematic processes for implementing technical debt management in software production have not been readily available.

To address this gap, we developed and field tested a normative process framework that systematically incorporates steps for managing technical debt in commercial software production. The framework integrates processes required for technical debt management with existing software quality management processes prescribed by the project management body of knowledge (PMBOK) [4], and organizes the different processes for technical debt management under three steps: (1) make technical debt visible, (2) perform cost-benefit analysis, and (3) control technical debt. To implement the processes, we introduce a new artifact, called the technical debt register, which stores, for each software asset, the outstanding principal and the associated interest estimated for the technical debt embedded in the asset. The technical debt register also stores the desired control target for each software asset's technical debt, which is populated and used during the cost-benefit analysis and control target calculations.

There are three main benefits from this integrated approach. First, it enables the uncovering of hidden technical debt embedded in systems. Established quality assurance and control practices can be utilized to effectively associate software defects with specific design and deployment decisions made by programmers. Such associations make technical debt visible to the team and thereby facilitate the quantification of debt-related principal and interest. Second, it helps to bridge the gaps that exist between the technical and economic assessments of technical debt, and aid in formulating actionable policies related to technical debt management. Finally, integrating technical debt management processes with established quality frameworks aids the wider adoption of emerging prescriptions for managing technical debt.

We partnered with three commercial software product development organizations to implement the framework in real-world software production settings. All three organizations, irrespective of their varying software process maturity levels, were able to adopt the proposed framework and integrate the prescribed technical debt management processes with their existing software quality management processes. Our longitudinal data and case-study interviews indicate that the organizations were able to accrue economic benefits from the adoption and use of the integrated framework. And, based on our field study observations, we also identified a set of best practices that support the implementation and use of our framework: facilitating engagement between business and engineering stakeholders, adoption of policies based on a probabilistic analysis framework, and limiting process overheads.

SESSION: Mining software repositories

Understanding the factors for fast answers in technical Q&A websites: an empirical study of four stack exchange websites

Technical questions and answers (Q&A) websites accumulate a significant amount of knowledge from users. Developers are especially active on these Q&A websites, since developers are constantly facing new development challenges that require help from other experts. Over the years, Q&A website designers have derived several incentive systems (e.g., gamification) to encourage users to answer questions that are posted by others. However, the current incentive systems primarily focus on the quantity and quality of the answers instead of encouraging the rapid answering of questions. Improving the speed of getting an answer can significantly improve the user experience and increase user engagement on such Q&A websites.

In this paper [1], we study the factors for fast answers on such Q&A websites. Our goal is to explore how one may improve the current incentive systems to motivate fast answering of questions. We use a logistic regression model to analyze 46 factors along four dimensions (i.e., question, asker, answer, and answerer dimension) in order to understand the relationship between the studied factors and the needed time to get an accepted answer. The question dimension calculates various textual and readability features of a question, as well as the popularity and difficulty of the question's tags. The asker dimension calculates the reputation of an asker and his/her historical tendency to get answers. The answer dimension computes textual features from the text of the accepted answer. The answerer dimension computes the historical activity level of the answerer who answered the question. We conduct our study on the four most popular (i.e., with the most questions) Q&A Stack Exchange websites: Stack Overflow, Mathematics, Ask Ubuntu, and Superuser. We find that i) factors in the answerer dimension have the strongest effect on the needed time to get an accepted answer, after controlling for other factors; ii) the current incentive system does not recognize non-frequent answerers who often answer questions which frequent answerers are not able to answer well. Such questions that are answered by non-frequent answerers are as important as those that are answered by frequent answerers; iii) the current incentive system motivates frequent answerers well, but such frequent answerers tend to answer short questions.

Our findings suggest that the designers of Q&A website should improve their incentive systems to motivate non-frequent answerers to be more active and to answer questions faster, in order to shorten the waiting time for an answer (especially for questions that require specific knowledge that frequent answerers might not possess). In addition, the question answering incentive system needs to factor in the value and difficulty of answering the questions (e.g., by providing more rewards to harder questions or questions that remain unanswered for a long period of time).

Towards reusing hints from past fixes: an exploratory study on thousands of real samples

Researchers have recently proposed various automatic program repair (APR) approaches that reuse past fixes to fix new bugs. However, some fundamental questions, such as how new fixes overlap with old fixes, have not been investigated. Intuitively, the overlap between old and new fixes decides how APR approaches can construct new fixes with old ones. Based on this intuition, we systematically designed six overlap metrics, and performed an empirical study on 5,735 bug fixes to investigate the usefulness of past fixes when composing new fixes. For each bug fix, we created delta dependency graphs (i.e., program dependency graphs for code changes), and identified how bug fixes overlapped with each other in terms of the content, code structure, and identifier names of fixes. Our results show that if an APR approach composes new fixes by fully or partially reusing the content of past fixes, only 2.1% and 3.2% new fixes can be created from single or multiple past fixes in the same project, compared with 0.9% and 1.2% fixes created from past fixes across projects. However, if an APR approach composes new fixes by fully or partially reusing the code structure of past fixes, up to 41.3% and 29.7% new fixes can be created.

Are code examples on an online Q&A forum reliable?: a study of API misuse on stack overflow

Programmers often consult an online Q&A forum such as Stack Overflow to learn new APIs. This paper presents an empirical study on the prevalence and severity of API misuse on Stack Overflow. To reduce manual assessment effort, we design ExampleCheck, an API usage mining framework that extracts patterns from over 380K Java repositories on GitHub and subsequently reports potential API usage violations in Stack Overflow posts. We analyze 217,818 Stack Overflow posts using ExampleCheck and find that 31% may have potential API usage violations that could produce unexpected behavior such as program crashes and resource leaks. Such API misuse is caused by three main reasons---missing control constructs, missing or incorrect order of API calls, and incorrect guard conditions. Even the posts that are accepted as correct answers or upvoted by other programmers are not necessarily more reliable than other posts in terms of API misuse. This study result calls for a new approach to augment Stack Overflow with alternative API usage details that are not typically shown in curated examples.

Inference of development activities from interaction with uninstrumented applications

Studying developers' behavior is crucial for designing effective techniques and tools to support developers' daily work. However, there are two challenges in collecting and analyzing developers' behavior data. First, instrumenting many software tools commonly used in real work settings (e.g., IDEs, web browsers) is difficult and requires significant resources. Second, the collected behavior data consist of low-level and fine-grained event sequences, which must be abstracted into high-level development activities for further analysis.

In this paper [1], to address these two challenges, we first use our ActivitySpace framework to improve the generalizability of the data collection. Then, we propose a Condition Random Field (CRF) based approach to segment and label the developers' low-level actions into a set of basic, yet meaningful development activities. To evaluate our proposed approach, we deploy the ActivitySpace framework in an industry partner's company and collect the real working data from ten professional developers' one-week work. We conduct an experiment with the collected data and a small number of initial human-labeled training data using the CRF model and the other three baselines (i.e., a heuristic-rules based method, a SVM classifier, and a random weighted classifier). The proposed CRF model achieves better performance (i.e., 0.728 accuracy and 0.672 macro-averaged F1-score) than the other three baselines.

SESSION: Models and modeling I

Propagating configuration decisions with modal implication graphs

Highly-configurable systems encompass thousands of interdependent configuration options, which require a non-trivial configuration process. Decision propagation enables a backtracking-free configuration process by computing values implied by user decisions. However, employing decision propagation for large-scale systems is a time-consuming task and, thus, can be a bottleneck in interactive configuration processes and analyses alike. We propose modal implication graphs to improve the performance of decision propagation by precomputing intermediate values used in the process. Our evaluation results show a significant improvement over state-of-the-art algorithms for 120 real-world systems.

A combinatorial approach for exposing off-nominal behaviors

Off-nominal behaviors (ONBs) have been a major concern in the areas of embedded systems and safety-critical systems. To address ONB problems, some researchers have proposed model-based approaches that can expose ONBs by analyzing natural language requirements documents. While these approaches produced promising results, they require a lot of human effort and time. In this paper, to reduce human effort and time, we propose a combinatorial-based approach, Combinatorial Causal Component Model (Combi-CCM), which uses structured requirements patterns and combinations generated using the IPOG algorithm. We conducted an empirical study using several requirements documents to evaluate our approach, and our results indicate that the proposed approach can reduce human effort and time while maintaining the same ONB exposure ability obtained by the control techniques.

Identifying design problems in the source code: a grounded theory

The prevalence of design problems may cause re-engineering or even discontinuation of the system. Due to missing, informal or outdated design documentation, developers often have to rely on the source code to identify design problems. Therefore, developers have to analyze different symptoms that manifest in several code elements, which may quickly turn into a complex task. Although researchers have been investigating techniques to help developers in identifying design problems, there is little knowledge on how developers actually proceed to identify design problems. In order to tackle this problem, we conducted a multi-trial industrial experiment with professionals from 5 software companies to build a grounded theory. The resulting theory offers explanations on how developers identify design problems in practice. For instance, it reveals the characteristics of symptoms that developers consider helpful. Moreover, developers often combine different types of symptoms to identify a single design problem. This knowledge serves as a basis to further understand the phenomena and advance towards more effective identification techniques.

Predicting future developer behavior in the IDE using topic models

Interaction data, gathered from developers' daily clicks and key presses in the IDE, has found use in both empirical studies and in recommendation systems for software engineering. We observe that this data has several characteristics, common across IDEs:

• exponentially distributed - some events or commands dominate the trace (e.g., cursor movement commands), while most other commands occur relatively infrequently.

• noisy - the traces include spurious commands (or clicks), or unrelated events, that may not be important to the behavior of interest.

• comprise of overlapping events and commands - specific commands can be invoked by separate mechanisms, and similar events can be triggered by different sources.

SESSION: Code search, synthesis, performance

Deep code search

To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code.

In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled.

As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.

Augmenting and structuring user queries to support efficient free-form code search

Motivation: Code search is an important activity in software development since developers are regularly searching [6] for code examples dealing with diverse programming concepts, APIs, and specific platform peculiarities. To help developers search for source code, several Internet-scale code search engines, such as OpenHub [5] and Codota [1] have been proposed. Unfortunately, these Internet-scale code search engines have limited performance since they treat source code as natural language documents. To improve the performance of search engines, the construction of the search space index as well as the mapping process of querying must address the challenge that "no single word can be chosen to describe a programming concept in the best way" [2]. This is known in the literature as the vocabulary mismatch problem [3].

Approach: We propose a novel approach to augmenting user queries in a free-form code search scenario. This approach aims at improving the quality of code examples returned by Internet-scale code search engines by building a Code voCaBulary (CoCaBu) [7]. The originality of CoCaBu is that it addresses the vocabulary mismatch problem, by expanding/enriching/re-targeting a user's free-form query, building on similar questions in Q&A sites so that a code search engine can find highly relevant code in source code repositories. Figure 1 provides an overview of our approach.

The search process begins with a free-form query from a user,

i.e., a sentence written in a natural language:

(a) For a given query, CoCaBu first searches for relevant posts in Q&A forums. The role of the Search Proxy is then to forward developer free-form queries to web search engines that can collect and rank entries in Q&A with the most relevant documents for the query.

(b) CoCaBu then generates an augmented query based on the information in the relevant posts. It mainly leverages code snippets in the previously identified posts. The Code Query Generator then creates another query which includes not only the initial user query terms but also program elements. To accelerate this step in the search process, CoCaBu builds upfront a snippet index for Q&A posts.

(c) Once the augmented query is constructed, CoCaBu searches source files for code locations that match the query terms. For this step, we crawl a large number of repositories and build upfront a code index of program elements in the source code.


• CoCaBu approach to the vocabulary mismatch problem: We propose a technique for finding relevant code with freeform query terms that describe programming tasks, with no a-priori knowledge on the API keywords to search for.

• GitSearch free-form search engine for GitHub: We instantiate the CoCaBu approach based on indices of Java files built from GitHub and Q&A posts from Stack Overflow to find the most relevant code examples for developer queries.

Empirical user evaluation: Comparison with popular code search engines further shows that GitSearch is more effective in returning acceptable code search results. In addition, Comparison against web search engines indicates that GitSearch is a competitive alternative. Finally, via a live study, we show that users on Q&A sites may find GitSearch's real code examples acceptable as answers to developer questions.

Concluding remarks: As a follow-up work, we have also leveraged Stack Overflow data to build a practical, novel, and efficient code-to-code search engine [4].

FaCoY: a code-to-code search engine

Code search is an unavoidable activity in software development. Various approaches and techniques have been explored in the literature to support code search tasks. Most of these approaches focus on serving user queries provided as natural language free-form input. However, there exists a wide range of use-case scenarios where a code-to-code approach would be most beneficial. For example, research directions in code transplantation, code diversity, patch recommendation can leverage a code-to-code search engine to find essential ingredients for their techniques. In this paper, we propose FaCoY, a novel approach for statically finding code fragments which may be semantically similar to user input code. FaCoY implements a query alternation strategy: instead of directly matching code query tokens with code in the search space, FaCoY first attempts to identify other tokens which may also be relevant in implementing the functional behavior of the input code. With various experiments, we show that (1) FaCoY is more effective than online code-to-code search engines; (2) FaCoY can detect more semantic code clones (i.e., Type-4) in BigCloneBench than the state-of-the-art; (3) FaCoY, while static, can detect code fragments which are indeed similar with respect to runtime execution behavior; and (4) FaCoY can be useful in code/patch recommendation.

Generalized data structure synthesis

Data structure synthesis is the task of generating data structure implementations from high-level specifications. Recent work in this area has shown potential to save programmer time and reduce the risk of defects. Existing techniques focus on data structures for manipulating subsets of a single collection, but real-world programs often track multiple related collections and aggregate properties such as sums, counts, minimums, and maximums.

This paper shows how to synthesize data structures that track subsets and aggregations of multiple related collections. Our technique decomposes the synthesis task into alternating steps of query synthesis and incrementalization. The query synthesis step implements pure operations over the data structure state by leveraging existing enumerative synthesis techniques, specialized to the data structures domain. The incrementalization step implements imperative state modifications by re-framing them as fresh queries that determine what to change, coupled with a small amount of code to apply the change. As an added benefit of this approach over previous work, the synthesized data structure is optimized for not only the queries in the specification but also the required update operations. We have evaluated our approach in four large case studies, demonstrating that these extensions are broadly applicable.

SESSION: Software tools and environments

A graph solver for the automated generation of consistent domain-specific models

Many testing and benchmarking scenarios in software and systems engineering depend on the systematic generation of graph models. For instance, tool qualification necessitated by safety standards would require a large set of consistent (well-formed or malformed) instance models specific to a domain. However, automatically generating consistent graph models which comply with a metamodel and satisfy all well-formedness constraints of industrial domains is a significant challenge. Existing solutions which map graph models into first-order logic specification to use back-end logic solvers (like Alloy or Z3) have severe scalability issues. In the paper, we propose a graph solver framework for the automated generation of consistent domain-specific instance models which operates directly over graphs by combining advanced techniques such as refinement of partial models, shape analysis, incremental graph query evaluation, and rule-based design space exploration to provide a more efficient guidance. Our initial performance evaluation carried out in four domains demonstrates that our approach is able to generate models which are 1-2 orders of magnitude larger (with 500 to 6000 objects!) compared to mapping-based approaches natively using Alloy.

Automatically finding bugs in a commercial cyber-physical system development tool chain with SLforge

Cyber-physical system (CPS) development tool chains are widely used in the design, simulation, and verification of CPS data-flow models. Commercial CPS tool chains such as MathWorks' Simulink generate artifacts such as code binaries that are widely deployed in embedded systems. Hardening such tool chains by testing is crucial since formally verifying them is currently infeasible. Existing differential testing frameworks such as CyFuzz can not generate models rich in language features, partly because these tool chains do not leverage the available informal Simulink specifications. Furthermore, no study of existing Simulink models is available, which could guide CyFuzz to generate realistic models.

To address these shortcomings, we created the first large collection of public Simulink models and used the collected models' properties to guide random model generation. To further guide model generation we systematically collected semi-formal Simulink specifications. In our experiments on several hundred models, the resulting SLforge generator was more effective and efficient than the state-of-the-art tool CyFuzz. SLforge also found 8 new confirmed bugs in Simulink.

Context-aware conversational developer assistants

Building and maintaining modern software systems requires developers to perform a variety of tasks that span various tools and information sources. The crosscutting nature of these development tasks requires developers to maintain complex mental models and forces them (a) to manually split their high-level tasks into low-level commands that are supported by the various tools, and (b) to (re)establish their current context in each tool. In this paper we present Devy, a Conversational Developer Assistant (CDA) that enables developers to focus on their high-level development tasks. Devy reduces the number of manual, often complex, low-level commands that developers need to perform, freeing them to focus on their high-level tasks. Specifically, Devy infers high-level intent from developer's voice commands and combines this with an automatically-generated context model to determine appropriate workflows for invoking low-level tool actions; where needed, Devy can also prompt the developer for additional information. Through a mixed methods evaluation with 21 industrial developers, we found that Devy provided an intuitive interface that was able to support many development tasks while helping developers stay focused within their development environment. While industrial developers were largely supportive of the automation Devy enabled, they also provided insights into several other tasks and workflows CDAs could support to enable them to better focus on the important parts of their development tasks.

Open source barriers to entry, revisited: a sociotechnical perspective

Research has revealed that significant barriers exist when entering Open-Source Software (OSS) communities and that women disproportionately experience such barriers. However, this research has focused mainly on social/cultural factors, ignoring the environment itself --- the tools and infrastructure. To shed some light onto how tools and infrastructure might somehow factor into OSS barriers to entry, we conducted a field study with five teams of software professionals, who worked through five use-cases to analyze the tools and infrastructure used in their OSS projects. These software professionals found tool/infrastructure barriers in 7% to 71% of the use-case steps that they analyzed, most of which are tied to newcomer barriers that have been established in the literature. Further, over 80% of the barrier types they found include attributes that are biased against women.

SESSION: Search-based software engineering I

Testing vision-based control systems using learnable evolutionary algorithms

Vision-based control systems are key enablers of many autonomous vehicular systems, including self-driving cars. Testing such systems is complicated by complex and multidimensional input spaces. We propose an automated testing algorithm that builds on learnable evolutionary algorithms. These algorithms rely on machine learning or a combination of machine learning and Darwinian genetic operators to guide the generation of new solutions (test scenarios in our context). Our approach combines multiobjective population-based search algorithms and decision tree classification models to achieve the following goals: First, classification models guide the search-based generation of tests faster towards critical test scenarios (i.e., test scenarios leading to failures). Second, search algorithms refine classification models so that the models can accurately characterize critical regions (i.e., the regions of a test input space that are likely to contain most critical test scenarios). Our evaluation performed on an industrial automotive automotive system shows that: (1) Our algorithm outperforms a baseline evolutionary search algorithm and generates 78% more distinct, critical test scenarios compared to the baseline algorithm. (2) Our algorithm accurately characterizes critical regions of the system under test, thus identifying the conditions that are likely to lead to system failures.

To preserve or not to preserve invalid solutions in search-based software engineering: a case study in software product lines

Multi-objective evolutionary algorithms (MOEAs) have been successfully applied for software product lines (SPLs) to search for optimal or near-optimal solutions that balance multiple objectives. However, MOEAs usually produce invalid solutions that violate the constraints predefined. As invalid solutions are unbuildable in practice, we debate the preservation of invalid solutions during the search. We conduct experiments on seven real-world SPLs, including five largest SPLs hitherto reported and two SPLs with realistic values and constraints of quality attributes. We identify three potential limitations of preserving invalid solutions. Furthermore, based on the state-of-the-art, we design five algorithm variants that adopt different evolutionary operators. By performance evaluation, we provide empirical guidance on how to preserve valid solutions. Our empirical study demonstrates that whether or not to preserve invalid solutions deserves more attention in the community, and in some cases, we have to preserve valid solutions all along the way.

Nemo: multi-criteria test-suite minimization with integer nonlinear programming

Multi-criteria test-suite minimization aims to remove redundant test cases from a test suite based on some criteria such as code coverage, while trying to optimally maintain the capability of the reduced suite based on other criteria such as fault-detection effectiveness. Existing techniques addressing this problem with integer linear programming claim to produce optimal solutions. However, the multi-criteria test-suite minimization problem is inherently nonlinear, due to the fact that test cases are often dependent on each other in terms of test-case criteria. In this paper, we propose a framework that formulates the multi-criteria test-suite minimization problem as an integer nonlinear programming problem. To solve this problem optimally, we programmatically transform this nonlinear problem into a linear one and then solve the problem using modern linear solvers. We have implemented our framework as a tool, called Nemo, that supports a number of modern linear and nonlinear solvers. We have evaluated Nemo with a publicly available dataset and minimization problems involving multiple criteria including statement coverage, fault-revealing capability, and test execution time. The experimental results show that Nemo can be used to efficiently find an optimal solution for multi-criteria test-suite minimization problems with modern solvers, and the optimal solutions outperform the suboptimal ones by up to 164.29% in terms of the criteria considered in the problem.

Is "better data" better than "better data miners"?: on the benefits of tuning SMOTE for defect prediction

We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-performance criteria while (b) fixing the weaker regions of the training data (using SMOTUNED, which is an auto-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions when applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for defects. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against the most recent class imbalance technique.

In conclusion, for software analytic tasks like defect prediction, (1) data pre-processing can be more important than classifier choice, (2) ranking studies are incomplete without such pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.


Analyzing the effects of test driven development in GitHub

Testing is an integral part of the software development lifecycle, approached with varying degrees of rigor by different process models. Agile process models recommend Test Driven Development (TDD) as a key practice for reducing costs and improving code quality. The objective of this work is to perform a cost-benefit analysis of this practice. Previous work by Fucci et al. [2, 3] engaged in laboratory studies of developers actively engaged in test-driven development practices. Fucci et al. found little difference between test-first behaviour of TDD and test-later behaviour. To that end, we opted to conduct a study about TDD behaviours in the "wild" rather than in the laboratory. Thus we have conducted a comparative analysis of GitHub repositories that adopts TDD to a lesser or greater extent, in order to determine how TDD affects software development productivity and software quality. We classified GitHub repositories archived in 2015 in terms of how rigorously they practiced TDD, thus creating a TDD spectrum. We then matched and compared various subsets of these repositories on this TDD spectrum with control sets of equal size. The control sets were samples from all GitHub repositories that matched certain characteristics, and that contained at least one test file. We compared how the TDD sets differed from the control sets on the following characteristics: number of test files, average commit velocity, number of bug-referencing commits, number of issues recorded, usage of continuous integration, number of pull requests, and distribution of commits per author. We found that Java TDD projects were relatively rare. In addition, there were very few significant differences in any of the metrics we used to compare TDD-like and non-TDD projects; therefore, our results do not reveal any observable benefits from using TDD.

A comparative study to benchmark cross-project defect prediction approaches

Cross-Project Defect Prediction (CPDP) as a means to focus quality assurance of software projects was under heavy investigation in recent years. However, within the current state-of-the-art it is unclear which of the many proposals performs best due to a lack of replication of results and diverse experiment setups that utilize different performance metrics and are based on different underlying data. Within this article [2, 3], we provide a benchmark for CPDP. Our benchmark replicates 24 CPDP approaches proposed by researchers between 2008 and 2015. Through our benchmark, we answer the following research questions:

RQ1: Which CPDP approaches perform best in terms of F-measure, G-measure, AUC, and MCC?

RQ2: Does any CPDP approach consistently fulfill the performance criteria for successful predictions postulated by Zimmermann et al. [4], i.e., have at least 0.75 recall, 0.75 precision, and 0.75 accuracy?

RQ3: What is the impact of using only larger products (> 100 instances) with a certain balance (at least 5% defective instances and at least 5% non-defective instances) on the benchmark results?

RQ4: What is the impact of using a relatively small subset of a larger data set on the benchmark results?

We identified 5 public data sets, which contain defect data about 86 software products that we used to answer these research question. The advantage of using multiple data sets was that we could increase the number of software products and, thereby, increase the external validity of our results. Moreover, we wanted to use multiple performance criteria for the evaluation of the CPDP approaches. Therefore, RQ1 ranks approaches not just using a single criterion, but using the four performance metrics AUC, F-measure, G-measure, and MCC. Existing approaches for the ranking of statistically different approaches neither account for software products from different data sets, nor multiple performance metrics. Therefore, we defined a new approach for the combination of separate rankings for the performance criteria and data sets, into one common ranking.

Figure 1 depicts the results for RQ1. The results show that an approach proposed by Camargo Cruz and Ochimizu [1] performs best and even outperforms cross-validation. Moreover, our results show that only 6 of the 24 approaches outperform one of our baselines, i.e., using all data for training without any transfer learning. Regarding RQ2, we determined that predictions only seldomly achieve a high performance of 0.75 recall, precision, and accuracy. The best CPDP approaches only fulfill the criterion for 4 of the 86 products, i.e., 4.6% of the time. Thus, CPDP still has not reached a point where the performance of the results is sufficient for the application in practice.

RQ3 and RQ4 were used to see if results are affected by subsetting data, as is often done for defect prediction experiments. For RQ3, i.e., using a large subset, we determined no difference between using all data and using the subset. For RQ4, i.e., using a small subset of of data, we found that there are statistically signifcant differences in reported performances of up to 5%. Thus, the use of small subsets should be avoided.

MSeer: an advanced technique for locating multiple bugs in parallel

In practice, a program may contain multiple bugs. The simultaneous presence of these bugs may deteriorate the effectiveness of existing fault-localization techniques to locate program bugs. While it is acceptable to use all failed and successful tests to identify suspicious code for programs with exactly one bug, it is not appropriate to use the same approach for programs with multiple bugs because the due-to relationship between failed tests and underlying bugs cannot be easily identified. One solution is to generate fault-focused clusters by grouping failed tests caused by the same bug into the same clusters. We propose MSeer - an advanced fault localization technique for locating multiple bugs in parallel. Our major contributions include the use of (1) a revised Kendall tau distance to measure the distance between two failed tests, (2) an innovative approach to simultaneously estimate the number of clusters and assign initial medoids to these clusters, and (3) an improved K-medoids clustering algorithm to better identify the due-to relationship between failed tests and their corresponding bugs. Case studies on 840 multiple-bug versions of seven programs suggest that MSeer performs better in terms of effectiveness and efficiency than two other techniques for locating multiple bugs in parallel.

Journal first presentation of an experience report on applying software testing academic results in industry: we need usable automated test generation

What is the impact of software engineering research on current practices in industry? In this paper, I report on my direct experience as a PhD/post-doc working in software engineering research projects, and then spending the following five years as an engineer in two different companies (the first one being the same I worked in collaboration with during my post-doc). Given a background in software engineering research, what cutting-edge techniques and tools from academia did I use in my daily work when developing and testing the systems of these companies? Regarding validation and verification (my main area of research), the answer is rather short: as far as I can tell, only FindBugs. In this paper, I report on why this was the case, and discuss all the challenging, complex open problems we face in industry and which somehow are "neglected" in the academic circles. In particular, I will first discuss what actual tools I could use in my daily work, such as JaCoCo and Selenium. Then, I will discuss the main open problems I faced, particularly related to environment simulators, unit and web testing. After that, popular topics in academia are presented, such as UML, regression and mutation testing. Their lack of impact on the type of projects I worked on in industry is then discussed. Finally, from this industrial experience, I provide my opinions about how this situation can be improved, in particular related to how academics are evaluated, and advocate for a greater involvement into open-source projects.

SESSION: Software evolution and maintenance II

CCAligner: a token based large-gap clone detector

Copying code and then pasting with large number of edits is a common activity in software development, and the pasted code is a kind of complicated Type-3 clone. Due to large number of edits, we consider the clone as a large-gap clone. Large-gap clone can reflect the extension of code, such as change and improvement. The existing state-of-the-art clone detectors suffer from several limitations in detecting large-gap clones. In this paper, we propose a tool, CCAligner, using code window that considers e edit distance for matching to detect large-gap clones. In our approach, a novel e-mismatch index is designed and the asymmetric similarity coefficient is used for similarity measure. We thoroughly evaluate CCAligner both for large-gap clone detection, and for general Type-1, Type-2 and Type-3 clone detection. The results show that CCAligner performs better than other competing tools in large-gap clone detection, and has the best execution time for 10MLOC input with good precision and recall in general Type-1 to Type-3 clone detection. Compared with existing state-of-the-art tools, CCAligner is the best performing large-gap clone detection tool, and remains competitive with the best clone detectors in general Type-1, Type-2 and Type-3 clone detection.

HireBuild: an automatic approach to history-driven repair of build scripts

Advancements in software build tools such as Maven reduce build management effort, but developers still need specialized knowledge and long time to maintain build scripts and resolve build failures. More recent build tools such as Gradle give developers greater extent of customization flexibility, but can be even more difficult to maintain. According to the TravisTorrent dataset of open-source software continuous integration, 22% of code commits include changes in build script files to maintain build scripts or to resolve build failures. Automated program repair techniques have great potential to reduce cost of resolving software failures, but the existing techniques mostly focus on repairing source code so that they cannot directly help resolving software build failures. To address this limitation, we propose HireBuild: <u>Hi</u>story-Driven <u>Rep</u>air of <u>Build</u> Scripts, the first approach to automatic patch generation for build scripts, using fix patterns automatically generated from existing build script fixes and recommending fix patterns based on build log similarity. From TravisTorrent dataset, we extracted 175 build failures and their corresponding fixes which revise Gradle build scripts. Among these 175 build failures, we used the 135 earlier build fixes for automatic fix-pattern generation and the more recent 40 build failures (fixes) for evaluation of our approach. Our experiment shows that our approach can fix 11 of 24 reproducible build failures, or 45% of the reproducible build failures, within comparable time of manual fixes.

The road to live programming: insights from the practice

Live Programming environments allow programmers to get feedback instantly while changing software. Liveness is gaining attention among industrial and open-source communities; several IDEs offer high degrees of liveness. While several studies looked at how programmers work during software evolution tasks, none of them consider live environments. We conduct such a study based on an analysis of 17 programming sessions of practitioners using Pharo, a mature Live Programming environment. The study is complemented by a survey and subsequent analysis of 16 programming sessions in additional languages, e.g., JavaScript. We document the approaches taken by developers during their work. We find that some liveness features are extensively used, and have an impact on the way developers navigate source code and objects in their work.

Assessing the threat of untracked changes in software evolution

While refactoring is extensively performed by practitioners, many Mining Software Repositories (MSR) approaches do not detect nor keep track of refactorings when performing source code evolution analysis. In the best case, keeping track of refactorings could be unnecessary work; in the worst case, these untracked changes could significantly affect the performance of MSR approaches. Since the extent of the threat is unknown, the goal of this paper is to assess whether it is significant. Based on an extensive empirical study, we answer positively: we found that between 10 and 21% of changes at the method level in 15 large Java systems are untracked. This results in a large proportion (25%) of entities that may have their histories split by these changes, and a measurable effect on at least two MSR approaches. We conclude that handling untracked changes should be systematically considered by MSR studies.

SESSION: Models and modeling II

Programming not only by example

Recent years have seen great progress in automated synthesis techniques that can automatically generate code based on some intent expressed by the programmer, but communicating this intent remains a major challenge. When the expressed intent is coarse-grained (for example, restriction on the expected type of an expression), the synthesizer often produces a long list of results for the programmer to choose from, shifting the heavy-lifting to the user. An alternative approach, successfully used in end-user synthesis, is programming by example (PBE), where the user leverages examples to interactively and iteratively refine the intent. However, using only examples is not expressive enough for programmers, who can observe the generated program and refine the intent by directly relating to parts of the generated program.

We present a novel approach to interacting with a synthesizer using a granular interaction model. Our approach employs a rich interaction model where (i) the synthesizer decorates a candidate program with debug information that assists in understanding the program and identifying good or bad parts, and (ii) the user is allowed to provide feedback not only on the expected output of a program but also on the program itself. After identifying a program as (partially) correct or incorrect, the user can also explicitly indicate the good or bad parts, to allow the synthesizer to accept or discard parts of the program instead of discarding the program as a whole.

We show the value of our approach in a controlled user study. Our study shows that participants have a strong preference for granular feedback instead of examples and can provide granular feedback much faster.

Goal-conflict likelihood assessment based on model counting

In goal-oriented requirements engineering approaches, conflict analysis has been proposed as an abstraction for risk analysis. Intuitively, given a set of expected goals to be achieved by the system-to-be, a conflict represents a subtle situation that makes goals diverge, i.e., not be satisfiable as a whole. Conflict analysis is typically driven by the identify-assess-control cycle, aimed at identifying, assessing and resolving conflicts that may obstruct the satisfaction of the expected goals. In particular, the assessment step is concerned with evaluating how likely the identified conflicts are, and how likely and severe are their consequences.

So far, existing assessment approaches restrict their analysis to obstacles (conflicts that prevent the satisfaction of a single goal), and assume that certain probabilistic information on the domain is provided, that needs to be previously elicited from experienced users, statistical data or simulations. In this paper, we present a novel automated approach to assess how likely a conflict is, that applies to general conflicts (not only obstacles) without requiring probabilistic information on the domain. Intuitively, given the LTL formulation of the domain and of a set of goals to be achieved, we compute goal conflicts, and exploit string model counting techniques to estimate the likelihood of the occurrence of the corresponding conflicting situations and the severity in which these affect the satisfaction of the goals. This information can then be used to prioritize conflicts to be resolved, and suggest which goals to drive attention to for refinements.

A posteriori typing for model-driven engineering: concepts, analysis, and applications

Model-Driven Engineering (MDE) is a software engineering paradigm where models are actively used to specify, test, simulate, analyse and maintain the systems to be built, among other activities. Models can be defined using general-purpose modelling languages like the UML, but for particular domains, the use of domain-specific languages is pervasive. Either way, models must conform to a meta-model which defines their abstract syntax.

In MDE, the definition of model management operations - often typed over project-specific meta-models - is recurrent. However, even if two operations are similar, they must be developed from scratch whenever they are applied to instances of different meta-models. This is so as operations defined (i.e., typed) over a meta-model cannot be directly reused for another. Part of this difficulty of reuse is because classes in meta-models are used in two ways: as templates to create objects and as static classifiers for them. These two aspects are inherently tied in most meta-modelling approaches, which results in unnecessarily rigid systems and hinders reusability of MDE artefacts.

To enhance flexibility and reuse in MDE, we propose an approach to decouple object creation from typing [1]. The approach relies on standard mechanisms for object creation, and proposes the notion of a posteriori typing as a means to retype objects and enable multiple, partial, dynamic typings.

A posteriori typing enhances flexibility because it allows models to be retyped with respect to other meta-models. Hence, we distinguish between creation meta-models used to construct models, and role meta-models into which models are retyped. This permits unanticipated reuse, as a model management operation defined for a role meta-model can be reused as-is with models built using a different creation meta-model, once such models are reclassified. Moreover our approach permits expressing some types of bidirectional model transformations by reclassification. The transformations defined as reclassifications have better performance than the equivalent ones defined with traditional transformation languages, because reclassification does not require creating new objects.

In [1], we propose two mechanisms to define a posteriori typings: type-level (mappings between meta-models) and instance-level (set of model queries). The paper presents the underlying theory and type correctness criteria of both mechanisms, defines some analysis methods, identifies practical restrictions for retyping specifications, and demonstrates the feasibility of the approach by an implementation atop our meta-modelling tool MetaDepth. We also explore application scenarios of a posteriori typing (to define transformations, for model transformation reuse, and to improve transformation expressiveness by dynamic type change), and present an experiment showing the potential performance gains when expressing transformations as retypings.

The tool is available at http://metadepth.org/. A catalogue of transformations expressed as retypings, and retypings bridging recurring meta-model heterogeneities, are available at http://miso.es/aposteriori/.

A static verification framework for message passing in Go using behavioural types

The Go programming language has been heavily adopted in industry as a language that efficiently combines systems programming with concurrency. Go's concurrency primitives, inspired by process calculi such as CCS and CSP, feature channel-based communication and lightweight threads, providing a distinct means of structuring concurrent software. Despite its popularity, the Go programming ecosystem offers little to no support for guaranteeing the correctness of message-passing concurrent programs.

This work proposes a practical verification framework for message passing concurrency in Go by developing a robust static analysis that infers an abstract model of a program's communication behaviour in the form of a behavioural type, a powerful process calculi typing discipline. We make use of our analysis to deploy a model and termination checking based verification of the inferred behavioural type that is suitable for a range of safety and liveness properties of Go programs, providing several improvements over existing approaches. We evaluate our framework and its implementation on publicly available real-world Go code.

SESSION: Inference and invariants

Inferring and asserting distributed system invariants

Distributed systems are difficult to debug and understand. A key reason for this is distributed state, which is not easily accessible and must be pieced together from the states of the individual nodes in the system.

We propose Dinv, an automatic approach to help developers of distributed systems uncover the runtime distributed state properties of their systems. Dinv uses static and dynamic program analyses to infer relations between variables at different nodes. For example, in a leader election algorithm, Dinv can relate the variable leader at different nodes to derive the invariant ∀ nodes i, j, leaderi = leaderj. This can increase the developer's confidence in the correctness of their system. The developer can also use Dinv to convert an inferred invariant into a distributed runtime assertion on distributed state.

We applied Dinv to several popular distributed systems, such as etcd Raft, Hashicorp Serf, and Taipei-Torrent, which have between 1.7K and 144K LOC and are widely used. Dinv derived useful invariants for these systems, including invariants that capture the correctness of distributed routing strategies, leadership, and key hash distribution. We also used Dinv to assert correctness of the inferred etcd Raft invariants at runtime, using these asserts to detect injected silent bugs.

DroidStar: callback typestates for Android classes

Event-driven programming frameworks, such as Android, are based on components with asynchronous interfaces. The protocols for interacting with these components can often be described by finite-state machines we dub callback typestates. Callback typestates are akin to classical typestates, with the difference that their outputs (callbacks) are produced asynchronously. While useful, these specifications are not commonly available, because writing them is difficult and error-prone.

Our goal is to make the task of producing callback typestates significantly easier. We present a callback typestate assistant tool, DroidStar, that requires only limited user interaction to produce a callback typestate. Our approach is based on an active learning algorithm, L*. We improved the scalability of equivalence queries (a key component of L*), thus making active learning tractable on the Android system.

We use DroidStar to learn callback typestates for Android classes both for cases where one is already provided by the documentation, and for cases where the documentation is unclear. The results show that DROIDSTAR learns callback typestates accurately and efficiently. Moreover, in several cases, the synthesized callback typestates uncovered surprising and undocumented behaviors.

Debugging with intelligence via probabilistic inference

We aim to debug a single failing execution without the assistance from other passing/failing runs. In our context, debugging is a process with substantial uncertainty - lots of decisions have to be made such as what variables shall be inspected first. To deal with such uncertainty, we propose to equip machines with human-like intelligence. Specifically, we develop a highly automated debugging technique that aims to couple human-like reasoning (e.g., dealing with uncertainty and fusing knowledge) with program semantics based analysis, to achieve benefits from the two and mitigate their limitations. We model debugging as a probabilistic inference problem, in which the likelihood of each executed statement instance and variable being correct/faulty is modeled by a random variable. Human knowledge, human-like reasoning rules and program semantics are modeled as conditional probability distributions, also called probabilistic constraints. Solving these constraints identifies the most likely faulty statements. Our results show that the technique is highly effective. It can precisely identify root causes for a set of real-world bugs in a very small number of interactions with developers, much smaller than a recent proposal that does not encode human intelligence. Our user study also confirms that it substantially improves human productivity.

Reducer-based construction of conditional verifiers

Despite recent advances, software verification remains challenging. To solve hard verification tasks, we need to leverage not just one but several different verifiers employing different technologies. To this end, we need to exchange information between verifiers. Conditional model checking was proposed as a solution to exactly this problem: The idea is to let the first verifier output a condition which describes the state space that it successfully verified and to instruct the second verifier to verify the yet unverified state space using this condition. However, most verifiers do not understand conditions as input.

In this paper, we propose the usage of an off-the-shelf construction of a conditional verifier from a given traditional verifier and a reducer. The reducer takes as input the program to be verified and the condition, and outputs a residual program whose paths cover the unverified state space described by the condition. As a proof of concept, we designed and implemented one particular reducer and composed three conditional model checkers from the three best verifiers at SV-COMP 2017. We defined a set of claims and experimentally evaluated their validity. All experimental data and results are available for replication.

SESSION: Surveys and reviews

Challenges and pitfalls on surveying evidence in the software engineering technical literature: an exploratory study with novices

The evidence-based software engineering approach advocates the use of scientific evidence by software engineers to support the adoption of software technologies in industrial software development and maintenance projects. Aside from the unavailability of scientific knowledge in industrial settings and the time required to acquire evidence in the software engineering (SE) field, additional challenges prevent practitioners, mainly those that are not experienced in research, from collecting knowledge from scientific sources to support the decisionmaking throughout their software projects.

Statistical errors in software engineering experiments: a preliminary literature review

Background: Statistical concepts and techniques are often applied incorrectly, even in mature disciplines such as medicine or psychology. Surprisingly, there are very few works that study statistical problems in software engineering (SE). Aim: Assess the existence of statistical errors in SE experiments. Method: Compile the most common statistical errors in experimental disciplines. Survey experiments published in ICSE to assess whether errors occur in high quality SE publications. Results: The same errors as identified in others disciplines were found in ICSE experiments, where 30% of the reviewed papers included several error types such as: a) missing statistical hypotheses, b) missing sample size calculation, c) failure to assess statistical test assumptions, and d) uncorrected multiple testing. This rather large error rate is greater for research papers where experiments are confined to the validation section. The origin of the errors can be traced back to: a) researchers not having sufficient statistical training, and, b) a profusion of exploratory research. Conclusions: This paper provides preliminary evidence that SE research suffers from the same statistical problems as other experimental disciplines. However, the SE community appears to be unaware of any shortcomings in its experiments, whereas other disciplines work hard to avoid these threats. Further research is necessary to find the underlying causes and set up corrective measures, but there are some potentially effective actions and are a priori easy to implement: a) improve the statistical training of SE researchers, and b) enforce quality assessment and reporting guidelines in SE publications.

Synthesizing qualitative research in software engineering: a critical review

Synthesizing data extracted from primary studies is an integral component of the methodologies in support of Evidence Based Software Engineering (EBSE) such as System Literature Review (SLR). Since a large and increasing number of studies in Software Engineering (SE) incorporate qualitative data, it is important to systematically review and understand different aspects of the Qualitative Research Synthesis (QRS) being used in SE. We have reviewed the use of QRS methods in 328 SLRs published between 2005 and 2015. We also inquired the authors of 274 SLRs to confirm whether or not any QRS methods were used in their respective reviews. 116 of them provided the responses, which were included in our analysis. We found eight QRS methods applied in SE research, two of which, narrative synthesis and thematic synthesis, have been predominantly adopted by SE researchers for synthesizing qualitative data. Our study determines that a significant amount of missing knowledge and incomplete understanding of the defined QRS methods in the community. Our effort also identifies an initial set factors that may influence the selection and use of appropriate QRS methods in SE.

SESSION: Search-based software engineering II

Automatic software repair: a survey

Debugging software failures is still a painful, time consuming, and expensive process. For instance, recent studies showed that debugging activities often account for about 50% of the overall development cost of software products [3]. There are many factors contributing to the cost of debugging, but the most impacting one is the extensive manual effort that is still required to identify and remove faults. So far, the automation of debugging activities essentially resulted in the development of techniques that provide useful insights about the possible locations of faults, the inputs and states of the application responsible for the failures, as well as the anomalous operations executed during failures. However, developers must still put a relevant effort on the analysis of the failed executions to exactly identify the faults that must be fixed. In addition, these techniques do not help the developers with the synthesis of an appropriate fix.

Search-based test data generation for SQL queries

Database-centric systems strongly rely on SQL queries to manage and manipulate their data. These SQL commands can range from very simple selections to queries that involve several tables, sub-queries, and grouping operations. And, as with any important piece of code, developers should properly test SQL queries. In order to completely test a SQL query, developers need to create test data that exercise all possible coverage targets in a query, e.g., JOINs and WHERE predicates. And indeed, this task can be challenging and time-consuming for complex queries. Previous studies have modeled the problem of generating test data as a constraint satisfaction problem and, with the help of SAT solvers, generate the required data. However, such approaches have strong limitations, such as partial support for queries with JOINs, subqueries, and strings (which are commonly used in SQL queries). In this paper, we model test data generation for SQL queries as a search-based problem. Then, we devise and evaluate three different approaches based on random search, biased random search, and genetic algorithms (GAs). The GA, in particular, uses a fitness function based on information extracted from the physical query plan of a database engine as search guidance. We then evaluate each approach in 2,135 queries extracted from three open source software and one industrial software system. Our results show that GA is able to completely cover 98.6% of all queries in the dataset, requiring only a few seconds per query. Moreover, it does not suffer from the limitations affecting state-of-the art techniques.

Multi-objective integer programming approaches for solving optimal feature selection problem: a new perspective on multi-objective optimization problems in SBSE

The optimal feature selection problem in software product line is typically addressed by the approaches based on Indicator-based Evolutionary Algorithm (IBEA). In this study we first expose the mathematical nature of this problem --- multi-objective binary integer linear programming. Then, we implement/propose three mathematical programming approaches to solve this problem at different scales. For small-scale problems (roughly less than 100 features), we implement two established approaches to find all exact solutions. For medium-to-large problems (roughly, more than 100 features), we propose one efficient approach that can generate a representation of the entire Pareto front in linear time complexity. The empirical results show that our proposed method can find significantly more non-dominated solutions in similar or less execution time, in comparison with IBEA and its recent enhancement (i.e., IBED that combines IBEA and Differential Evolution).

Automated refactoring of OCL constraints with search

Object Constraint Language (OCL) constraints are typically used for providing precise semantics to models developed with the Unified Modeling Language (UML). When OCL constraints evolve in a regular basis, it is essential that they are easy to understand and maintain. For instance, in cancer registries, to ensure the quality of cancer data, more than one thousand medical rules are defined and evolve regularly. Such rules can be specified with OCL. It is, therefore, important to ensure the understandability and maintainability of medical rules specified with OCL.

To tackle such a challenge, we propose an automated search-<u>b</u>ased <u>O</u>CL constraint <u>r</u>efactoring <u>a</u>pproach (SBORA) by defining and applying three OCL quality metrics (Complexity, Coupling, and Cohesion) and four semantics-preserving refactoring operators (i.e., Context Change, Swap, Split and Merge) which are encoded as potential solutions for search algorithms. A solution is therefore an optimal sequence of refactoring operators, which are sequentially applied to the original set of OCL constraints to automatically obtain a semantically equivalent set of OCL constraints with better understandability and maintainability in terms of Complexity, Coupling, and Cohesion.

We evaluate SBORA along with six commonly used multi-objective search algorithms (e.g., Indicator-Based Evolutionary Algorithm (IBEA)) by employing four case studies from different domains: healthcare (i.e., cancer registry system from Cancer Registry of Norway (CRN)), Oil&Gas (i.e., subsea production systems), warehouse (i.e., handling systems), and an open source case study named SEPA. Results show: 1) IBEA achieves the best performance among all the search algorithms and 2) the refactoring approach along with IBEA can manage to reduce on average 29.25% Complexity and 39% Coupling and improve 47.75% Cohesion, as compared to the original OCL constraint set from CRN. To further test the performance of SBORA, we also applied it to refactor an OCL constraint set specified on the UML 2.3 metamodel and we obtained encouraging results. Furthermore, we conducted a controlled experiment with 96 subjects and results show that the understandability and maintainability of the original constraint set can be improved significantly from the perspectives of the 96 participants of the controlled experiment.

Automatically generating search heuristics for concolic testing

We present a technique to automatically generate search heuristics for concolic testing. A key challenge in concolic testing is how to effectively explore the program's execution paths to achieve high code coverage in a limited time budget. Concolic testing employs a search heuristic to address this challenge, which favors exploring particular types of paths that are most likely to maximize the final coverage. However, manually designing a good search heuristic is nontrivial and typically ends up with suboptimal and unstable outcomes. The goal of this paper is to overcome this shortcoming of concolic testing by automatically generating search heuristics. We define a class of search heuristics, namely a parameterized heuristic, and present an algorithm that efficiently finds an optimal heuristic for each subject program. Experimental results with open-source C programs show that our technique successfully generates search heuristics that significantly outperform existing manually-crafted heuristics in terms of branch coverage and bug-finding.