ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

Full Citation in the ACM Digital Library

SESSION: Invited Papers

The rising tide lifts all boats: the advancement of science in cyber security (invited talk)

Stolen passwords, compromised medical records, taking the internet out through video cameras– cybersecurity breaches are in the news every day. Despite all this, the practice of cybersecurity today is generally reactive rather than proactive. That is, rather than improving their defenses in advance, organizations react to attacks once they have occurred by patching the individual vulnerabilities that led to those attacks. Researchers engineer solutions to the latest form of attack. What we need, instead, are scientifically founded design principles for building in security mechanisms from the beginning, giving protection against broad classes of attacks. Through scientific measurement, we can improve our ability to make decisions that are evidence-based, proactive, and long-sighted. Recognizing these needs, the US National Security Agency (NSA) devised a new framework for collaborative research, the “Lablet” structure, with the intent to more aggressively advance the science of cybersecurity. A key motivation was to catalyze a shift in relevant areas towards a more organized and cohesive scientific community. The NSA named Carnegie Mellon University, North Carolina State University, and the University of Illinois – Urbana Champaign its initial Lablets in 2011, and added the University of Maryland in 2014.

This talk will reflect on the structure of the collaborative research efforts of the Lablets, lessons learned in the transition to more scientific concepts to cybersecurity, research results in solving five hard security problems, and methods that are being used for the measurement of scientific progress of the Lablet research.

Verifying the forecast: how climate models are developed and tested (invited talk)

Stolen passwords, compromised medical records, taking the internet out through video cameras– cybersecurity breaches are in the news every day. Despite all this, the practice of cybersecurity today is generally reactive rather than proactive. That is, rather than improving their defenses in advance, organizations react to attacks once they have occurred by patching the individual vulnerabilities that led to those attacks. Researchers engineer solutions to the latest form of attack. What we need, instead, are scientifically founded design principles for building in security mechanisms from the beginning, giving protection against broad classes of attacks. Through scientific measurement, we can improve our ability to make decisions that are evidence-based, proactive, and long-sighted. Recognizing these needs, the US National Security Agency (NSA) devised a new framework for collaborative research, the “Lablet” structure, with the intent to more aggressively advance the science of cybersecurity. A key motivation was to catalyze a shift in relevant areas towards a more organized and cohesive scientific community. The NSA named Carnegie Mellon University, North Carolina State University, and the University of Illinois – Urbana Champaign its initial Lablets in 2011, and added the University of Maryland in 2014.

This talk will reflect on the structure of the collaborative research efforts of the Lablets, lessons learned in the transition to more scientific concepts to cybersecurity, research results in solving five hard security problems, and methods that are being used for the measurement of scientific progress of the Lablet research.

Software engineering research results in industrial practice: a tale of two projects (invited talk)

In this talk, I will discuss the use of software engineering research results in industrial practice, based on two projects I have been involved with. The first project addressed the challenge that manipulation of financial market data had to be expressed precisely for a large number of different financial markets. The challenge was addressed by defining a functional Domain Specific Language (DSL) that was geared towards expressing these manipulations at a high level of abstraction. An environment that implements the DSL was built using the Eclipse platform together with a compiler that generates a Java-based reference implementation of these manipulations. The implementation is used as a test oracle to generate test cases, which are in turn used to validate a soft real-time system that implements these manipulations. In another project that is still ongoing, I have proposed the use of software product line research to engineer a family of mobile banking applications. I will reflect on the experience of integrating software product line principles and modern Agile development practices. I will then discuss a few areas of software engineering research, that I have personally been involved in, that I have found not to be very useful in practice. I will conclude by outlining some topics where novel research results would be very beneficial from an industrial point of view.

Reflections on the REST architectural style and "principled design of the modern web architecture" (impact paper award)

Seventeen years after its initial publication at ICSE 2000, the Representational State Transfer (REST) architectural style continues to hold significance as both a guide for understanding how the World Wide Web is designed to work and an example of how principled design, through the application of architectural styles, can impact the development and understanding of large-scale software architecture. However, REST has also become an industry buzzword: frequently abused to suit a particular argument, confused with the general notion of using HTTP, and denigrated for not being more like a programming methodology or implementation framework.

In this paper, we chart the history, evolution, and shortcomings of REST, as well as several related architectural styles that it inspired, from the perspective of a chain of doctoral dissertations produced by the University of California's Institute for Software Research at UC Irvine. These successive theses share a common theme: extending the insights of REST to new domains and, in their own way, exploring the boundary of software engineering as it applies to decentralized software architectures and architectural design. We conclude with discussion of the circumstances, environment, and organizational characteristics that gave rise to this body of work.

SESSION: Research Papers

A fast causal profiler for task parallel programs

This paper proposes TASKPROF, a profiler that identifies parallelism bottlenecks in task parallel programs. It leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program. TASKPROF’s use of hardware performance counters to perform fine-grained measurements minimizes perturbation. TASKPROF’s profile execution runs in parallel using multi-cores. TASKPROF’s causal profile enables users to estimate improvements in parallelism when a region of code is optimized even when concrete optimizations are not yet known. We have used TASKPROF to isolate parallelism bottlenecks in twenty three applications that use the Intel Threading Building Blocks library. We have designed parallelization techniques in five applications to increase parallelism by an order of magnitude using TASKPROF. Our user study indicates that developers are able to isolate performance bottlenecks with ease using TASKPROF.

On the scalability of Linux kernel maintainers' work

Open source software ecosystems evolve ways to balance the workload among groups of participants ranging from core groups to peripheral groups. As ecosystems grow, it is not clear whether the mechanisms that previously made them work will continue to be relevant or whether new mechanisms will need to evolve. The impact of failure for critical ecosystems such as Linux is enormous, yet the understanding of why they function and are effective is limited. We, therefore, aim to understand how the Linux kernel sustains its growth, how to characterize the workload of maintainers, and whether or not the existing mechanisms are scalable. We quantify maintainers' work through the files that are maintained, and the change activity and the numbers of contributors in those files. We find systematic differences among modules; these differences are stable over time, which suggests that certain architectural features, commercial interests, or module-specific practices lead to distinct sustainable equilibria. We find that most of the modules have not grown appreciably over the last decade; most growth has been absorbed by a few modules. We also find that the effort per maintainer does not increase, even though the community has hypothesized that required effort might increase. However, the distribution of work among maintainers is highly unbalanced, suggesting that a few maintainers may experience increasing workload. We find that the practice of assigning multiple maintainers to a file yields only a power of 1/2 increase in productivity. We expect that our proposed framework to quantify maintainer practices will help clarify the factors that allow rapidly growing ecosystems to be sustainable.

Modeling and verification of evolving cyber-physical spaces

We increasingly live in cyber-physical spaces -- spaces that are both physical and digital, and where the two aspects are intertwined. Such spaces are highly dynamic and typically undergo continuous change. Software engineering can have a profound impact in this domain, by defining suitable modeling and specification notations as well as supporting design-time formal verification. In this paper, we present a methodology and a technical framework which support modeling of evolving cyber-physical spaces and reasoning about their spatio-temporal properties. We utilize a discrete, graph-based formalism for modeling cyber-physical spaces as well as primitives of change, giving rise to a reactive system consisting of rewriting rules with both local and global application conditions. Formal reasoning facilities are implemented adopting logic-based specification of properties and according model checking procedures, in both spatial and temporal fragments. We evaluate our approach using a case study of a disaster scenario in a smart city.

Easy over hard: a case study on deep learning

While deep learning is an exciting new technique, the benefits of this method need to be assessed with respect to its computational cost. This is particularly important for deep learning since these learners need hours (to weeks) to train the model. Such long training time limits the ability of (a)~a researcher to test the stability of their conclusion via repeated runs with different random seeds; and (b)~other researchers to repeat, improve, or even refute that original work.

For example, recently, deep learning was used to find which questions in the Stack Overflow programmer discussion forum can be linked together. That deep learning system took 14 hours to execute. We show here that applying a very simple optimizer called DE to fine tune SVM, it can achieve similar (and sometimes better) results. The DE approach terminated in 10 minutes; i.e. 84 times faster hours than deep learning method.

We offer these results as a cautionary tale to the software analytics community and suggest that not every new innovation should be applied without critical analysis. If researchers deploy some new and expensive process, that work should be baselined against some simpler and faster alternatives.

Finding near-optimal configurations in product lines by random sampling

Software Product Lines (SPLs) are highly configurable systems. This raises the challenge to find optimal performing configurations for an anticipated workload. As SPL configuration spaces are huge, it is infeasible to benchmark all configurations to find an optimal one. Prior work focused on building performance models to predict and optimize SPL configurations. Instead, we randomly sample and recursively search a configuration space directly to find near-optimal configurations without constructing a prediction model. Our algorithms are simpler and have higher accuracy and efficiency.

Revisiting unsupervised learning for defect prediction

Collecting quality data from software projects can be time-consuming and expensive. Hence, some researchers explore "unsupervised" approaches to quality prediction that does not require labelled data. An alternate technique is to use "supervised" approaches that learn models from project data labelled with, say, "defective" or "not-defective". Most researchers use these supervised models since, it is argued, they can exploit more knowledge of the projects.

At FSE-16, Yang et al. reported startling results where unsupervised defect predictors outperformed supervised predictors for effort-aware just-in-time defect prediction. If confirmed, these results would lead to a dramatic simplification of a seemingly complex task (data mining) that is widely explored in the software engineering literature.

This paper repeats and refutes those results as follows. (1) There is much variability in the efficacy of the Yang et al. predictors so even with their approach, some supervised data is required to prune weaker predictors away. (2) Their findings were grouped across N projects. When we repeat their analysis on a project-by-project basis, supervised predictors are seen to work better.

Even though this paper rejects the specific conclusions of Yang et al., we still endorse their general goal. In our our experiments, supervised predictors did not perform outstandingly better than unsupervised ones for effort-aware just-in-time defect prediction. Hence, they may indeed be some combination of unsupervised learners to achieve comparable performance to supervised ones. We therefore encourage others to work in this promising area.

Loopster: static loop termination analysis

Loop termination is an important problem for proving the correctness of a system and ensuring that the system always reacts. Existing loop termination analysis techniques mainly depend on the synthesis of ranking functions, which is often expensive. In this paper, we present a novel approach, named Loopster, which performs an efficient static analysis to decide the termination for loops based on path termination analysis and path dependency reasoning. Loopster adopts a divide-and-conquer approach: (1) we extract individual paths from a target multi-path loop and analyze the termination of each path, (2) analyze the dependencies between each two paths, and then (3) determine the overall termination of the target loop based on the relations among paths. We evaluate Loopster by applying it on the loop termination competition benchmark and three real-world projects. The results show that Loopster is effective in a majority of loops with better accuracy and 20 ×+ performance improvement compared to the state-of-the-art tools.


We present CodeCarbonCopy (CCC), a system for transferring code from a donor application into a recipient application. CCC starts with functionality identified by the developer to transfer into an insertion point (again identified by the developer) in the recipient. CCC uses paired executions of the donor and recipient on the same input file to obtain a translation between the data representation and name space of the recipient and the data representation and name space of the donor. It also implements a static analysis that identifies and removes irrelevant functionality useful in the donor but not in the recipient. We evaluate CCC on eight transfers between six applications. Our results show that CCC can successfully transfer donor functionality into recipient applications.

The power of "why" and "why not": enriching scenario exploration with provenance

Scenario-finding tools like the Alloy Analyzer are widely used in numerous concrete domains like security, network analysis, UML analysis, and so on. They can help to verify properties and, more generally, aid in exploring a system's behavior.

While scenario finders are valuable for their ability to produce concrete examples, individual scenarios only give insight into what is possible, leaving the user to make their own conclusions about what might be necessary. This paper enriches scenario finding by allowing users to ask ``why?'' and ``why not?'' questions about the examples they are given. We show how to distinguish parts of an example that cannot be consistently removed (or changed) from those that merely reflect underconstraint in the specification. In the former case we show how to determine which elements of the specification and which other components of the example together explain the presence of such facts.

This paper formalizes the act of computing provenance in scenario-finding. We present Amalgam, an extension of the popular Alloy scenario-finder, which implements these foundations and provides interactive exploration of examples. We also evaluate Amalgam's algorithmics on a variety of both textbook and real-world examples.

Where is the bug and how is it fixed? an experiment with practitioners

Research has produced many approaches to automatically locate, explain, and repair software bugs. But do these approaches relate to the way practitioners actually locate, understand, and fix bugs? To help answer this question, we have collected a dataset named DBGBENCH --- the correct fault locations, bug diagnoses, and software patches of 27 real errors in open-source C projects that were consolidated from hundreds of debugging sessions of professional software engineers. Moreover, we shed light on the entire debugging process, from constructing a hypothesis to submitting a patch, and how debugging time, difficulty, and strategies vary across practitioners and types of errors. Most notably, DBGBENCH can serve as reality check for novel automated debugging and repair techniques.

Understanding misunderstandings in source code

Humans often mistake the meaning of source code, and so misjudge a program's true behavior. These mistakes can be caused by extremely small, isolated patterns in code, which can lead to significant runtime errors. These patterns are used in large, popular software projects and even recommended in style guides. To identify code patterns that may confuse programmers we extracted a preliminary set of `atoms of confusion' from known confusing code. We show empirically in an experiment with 73 participants that these code patterns can lead to a significantly increased rate of misunderstanding versus equivalent code without the patterns. We then go on to take larger confusing programs and measure (in an experiment with 43 participants) the impact, in terms of programmer confusion, of removing these confusing patterns. All of our instruments, analysis code, and data are publicly available online for replication, experimentation, and feedback.

Measuring neural efficiency of program comprehension

Most modern software programs cannot be understood in their entirety by a single programmer. Instead, programmers must rely on a set of cognitive processes that aid in seeking, filtering, and shaping relevant information for a given programming task. Several theories have been proposed to explain these processes, such as ``beacons,' for locating relevant code, and ``plans,'' for encoding cognitive models. However, these theories are decades old and lack validation with modern cognitive-neuroscience methods. In this paper, we report on a study using functional magnetic resonance imaging (fMRI) with 11 participants who performed program comprehension tasks. We manipulated experimental conditions related to beacons and layout to isolate specific cognitive processes related to bottom-up comprehension and comprehension based on semantic cues. We found evidence of semantic chunking during bottom-up comprehension and lower activation of brain areas during comprehension based on semantic cues, confirming that beacons ease comprehension.

Bayesian specification learning for finding API usage errors

We present a Bayesian framework for learning probabilistic specifications from large, unstructured code corpora, and then using these specifications to statically detect anomalous, hence likely buggy, program behavior. Our key insight is to build a statistical model that correlates all specifications hidden inside a corpus with the syntax and observed behavior of programs that implement these specifications. During the analysis of a particular program, this model is conditioned into a posterior distribution that prioritizes specifications that are relevant to the program. The problem of finding anomalies is now framed quantitatively, as a problem of computing a distance between a "reference distribution" over program behaviors that our model expects from the program, and the distribution over behaviors that the program actually produces.

We implement our ideas in a system, called Salento, for finding anomalous API usage in Android programs. Salento learns specifications using a combination of a topic model and a neural network model. Our encouraging experimental results show that the system can automatically discover subtle errors in Android applications in the wild, and has high precision and recall compared to competing probabilistic approaches.

Synergistic debug-repair of heap manipulations

We present Wolverine, an integrated Debug-Repair environment for heap manipulating programs. Wolverine facilitates stepping through a concrete program execution, provides visualizations of the abstract program states (as box-and-arrow diagrams) and integrates a novel, proof-directed repair algorithm to synthesize repair patches. To provide a seamless environment, Wolverine supports "hot-patching" of the generated repair patches, enabling the programmer to continue the debug session without requiring an abort-compile-debug cycle. We also propose new debug-repair possibilities, "specification refinement" and "specification slicing" made possible by Wolverine. We evaluate our framework on 1600 buggy programs (generated using fault injection) on a variety of data-structures like singly, doubly and circular linked-lists, Binary Search Trees, AVL trees, Red-Black trees and Splay trees; Wolverine could repair all the buggy instances within reasonable time (less than 5 sec in most cases). We also evaluate Wolverine on 247 (buggy) student submissions; Wolverine could repair more than 80% of programs where the student had made a reasonable attempt.

Failure-directed program trimming

This paper describes a new program simplification technique called program trimming that aims to improve the scalability and precision of safety checking tools. Given a program P, program trimming generates a new program P' such that P and P' are equi-safe (i.e., P' has a bug if and only if P has a bug), but P' has fewer execution paths than P. Since many program analyzers are sensitive to the number of execution paths, program trimming has the potential to improve the effectiveness of safety checking tools.

In addition to introducing the concept of program trimming, this paper also presents a lightweight static analysis that can be used as a pre-processing step to remove program paths while retaining equi-safety. We have implemented the proposed technique in a tool called Trimmer and evaluate it in the context of two program analysis techniques, namely abstract interpretation and dynamic symbolic execution. Our experiments show that program trimming significantly improves the effectiveness of both techniques.

Why modern open source projects fail

Open source is experiencing a renaissance period, due to the appearance of modern platforms and workflows for developing and maintaining public code. As a result, developers are creating open source software at speeds never seen before. Consequently, these projects are also facing unprecedented mortality rates. To better understand the reasons for the failure of modern open source projects, this paper describes the results of a survey with the maintainers of 104 popular GitHub systems that have been deprecated. We provide a set of nine reasons for the failure of these open source projects. We also show that some maintenance practices---specifically the adoption of contributing guidelines and continuous integration---have an important association with a project failure or success. Finally, we discuss and reveal the principal strategies developers have tried to overcome the failure of the studied projects.

Trade-offs in continuous integration: assurance, security, and flexibility

Continuous integration (CI) systems automate the compilation, building, and testing of software. Despite CI being a widely used activity in software engineering, we do not know what motivates developers to use CI, and what barriers and unmet needs they face. Without such knowledge, developers make easily avoidable errors, tool builders invest in the wrong direction, and researchers miss opportunities for improving the practice of CI. We present a qualitative study of the barriers and needs developers face when using CI. We conduct semi-structured interviews with developers from different industries and development scales. We triangulate our findings by running two surveys. We find that developers face trade-offs between speed and certainty (Assurance), between better access and information security (Security), and between more configuration options and greater ease of use (Flexi- bility). We present implications of these trade-offs for developers, tool builders, and researchers.

µDroid: an energy-aware mutation testing framework for Android

The rising popularity of mobile apps deployed on battery-constrained devices underlines the need for effectively evaluating their energy properties. However, currently there is a lack of testing tools for evaluating the energy properties of apps. As a result, for energy testing, developers are relying on tests intended for evaluating the functional correctness of apps. Such tests may not be adequate for revealing energy defects and inefficiencies in apps. This paper presents an energy-aware mutation testing framework, called μDROID, that can be used by developers to assess the adequacy of their test suite for revealing energy-related defects. μDROID implements fifty energy-aware mutation operators and relies on a novel, automatic oracle to determine if a mutant can be killed by a test. Our evaluation on real-world Android apps shows the ability of proposed mutation operators for evaluating the utility of tests in revealing energy defects. Moreover, our automated oracle can detect whether tests kill the energy mutants with an overall accuracy of 94%, thereby making it possible to apply μDROID automatically.

PATDroid: permission-aware GUI testing of Android

Recent introduction of a dynamic permission system in Android, allowing the users to grant and revoke permissions after the installation of an app, has made it harder to properly test apps. Since an app's behavior may change depending on the granted permissions, it needs to be tested under a wide range of permission combinations. At the state-of-the-art, in the absence of any automated tool support, a developer needs to either manually determine the interaction of tests and app permissions, or exhaustively re-execute tests for all possible permission combinations, thereby increasing the time and resources required to test apps. This paper presents an automated approach, called PATDroid, for efficiently testing an Android app while taking the impact of permissions on its behavior into account. PATDroid performs a hybrid program analysis on both an app under test and its test suite to determine which tests should be executed on what permission combinations. Our experimental results show that PATDroid significantly reduces the testing effort, yet achieves comparable code coverage and fault detection capability as exhaustively testing an app under all permission combinations.

Enabling mutation testing for Android apps

Mutation testing has been widely used to assess the fault-detection effectiveness of a test suite, as well as to guide test case generation or prioritization. Empirical studies have shown that, while mutants are generally representative of real faults, an effective application of mutation testing requires “traditional” operators designed for programming languages to be augmented with operators specific to an application domain and/or technology. This paper proposes MDroid+, a framework for effective mutation testing of Android apps. First, we systematically devise a taxonomy of 262 types of Android faults grouped in 14 categories by manually analyzing 2,023 so ware artifacts from different sources (e.g., bug reports, commits). Then, we identified a set of 38 mutation operators, and implemented an infrastructure to automatically seed mutations in Android apps with 35 of the identified operators. The taxonomy and the proposed operators have been evaluated in terms of stillborn/trivial mutants generated as compared to well know mutation tools, and their capacity to represent real faults in Android apps

Guided, stochastic model-based GUI testing of Android apps

Mobile apps are ubiquitous, operate in complex environments and are developed under the time-to-market pressure. Ensuring their correctness and reliability thus becomes an important challenge. This paper introduces Stoat, a novel guided approach to perform stochastic model-based testing on Android apps. Stoat operates in two phases: (1) Given an app as input, it uses dynamic analysis enhanced by a weighted UI exploration strategy and static analysis to reverse engineer a stochastic model of the app's GUI interactions; and (2) it adapts Gibbs sampling to iteratively mutate/refine the stochastic model and guides test generation from the mutated models toward achieving high code and model coverage and exhibiting diverse sequences. During testing, system-level events are randomly injected to further enhance the testing effectiveness.

Stoat was evaluated on 93 open-source apps. The results show (1) the models produced by Stoat cover 17~31% more code than those by existing modeling tools; (2) Stoat detects 3X more unique crashes than two state-of-the-art testing tools, Monkey and Sapienz. Furthermore, Stoat tested 1661 most popular Google Play apps, and detected 2110 previously unknown and unique crashes. So far, 43 developers have responded that they are investigating our reports. 20 of reported crashes have been confirmed, and 8 already fixed.

Using bad learners to find good configurations

Finding the optimally performing configuration of a software system for a given setting is often challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, building an accurate performance model can be very expensive (and is often infeasible in practice). The central insight of this paper is that exact performance values (e.g., the response time of a software system) are not required to rank configurations and to identify the optimal one. As shown by our experiments, performance models that are cheap to learn but inaccurate (with respect to the difference between actual and predicted performance) can still be used rank configurations and hence find the optimal configuration. This novel rank-based approach allows us to significantly reduce the cost (in terms of number of measurements of sample configuration) as well as the time required to build performance models. We evaluate our approach with 21 scenarios based on 9 software systems and demonstrate that our approach is beneficial in 16 scenarios; for the remaining 5 scenarios, an accurate model can be built by using very few samples anyway, without the need for a rank-based approach.

Attributed variability models: outside the comfort zone

Variability models are often enriched with attributes, such as performance, that encode the influence of features on the respective attribute. In spite of their importance, there are only few attributed variability models available that have attribute values obtained from empirical, real-world observations and that cover interactions between features. But, what does it mean for research and practice when staying in the comfort zone of developing algorithms and tools in a setting where artificial attribute values are used and where interactions are neglected? This is the central question that we want to answer here. To leave the comfort zone, we use a combination of kernel density estimation and a genetic algorithm to rescale a given (real-world) attribute-value profile to a given variability model. To demonstrate the influence and relevance of realistic attribute values and interactions, we present a replication of a widely recognized, third-party study, into which we introduce realistic attribute values and interactions. We found statistically significant differences between the original study and the replication. We infer lessons learned to conduct experiments that involve attributed variability models. We also provide the accompanying tool Thor for generating attribute values including interactions. Our solution is shown to be agnostic about the given input distribution and to scale to large variability models.

Kmax: finding all configurations of Kbuild makefiles statically

Feature-oriented software design is a useful paradigm for building and reasoning about highly-configurable software. By making variability explicit, feature-oriented tools and languages make program analysis tasks easier, such as bug-finding, maintenance, and more. But critical software, such as Linux, coreboot, and BusyBox rely instead on brittle tools, such as Makefiles, to encode variability, impeding variability-aware tool development. Summarizing Makefile behavior for all configurations is difficult, because Makefiles have unusual semantics, and exhaustive enumeration of all configurations is intractable in practice. Existing approaches use ad-hoc heuristics, missing much of the encoded variability in Makefiles. We present Kmax, a new static analysis algorithm and tool for Kbuild Makefiles. It is a family-based variability analysis algorithm, where paths are Boolean expressions of configuration options, called reaching configurations, and its abstract state enumerates string values for all configurations. Kmax localizes configuration explosion to the statement level, making precise analysis tractable. The implementation analyzes Makefiles from the Kbuild build system used by several low-level systems projects. Evaluation of Kmax on the Linux and BusyBox build systems shows it to be accurate, precise, and fast. It is the first tool to collect all source files and their configurations from Linux. Compared to previous approaches, Kmax is far more accurate and precise, performs with little overhead, and scales better.

Is there a mismatch between real-world feature models and product-line research?

Feature modeling has emerged as the de-facto standard to compactly capture the variability of a software product line. Multiple feature modeling languages have been proposed that evolved over the last decades to manage industrial-size product lines. However, less expressive languages, solely permitting require and exclude constraints, are permanently and carelessly used in product-line research. We address the problem whether those less expressive languages are sufficient for industrial product lines. We developed an algorithm to eliminate complex cross-tree constraints in a feature model, enabling the combination of tools and algorithms working with different feature model dialects in a plug-and-play manner. However, the scope of our algorithm is limited. Our evaluation on large feature models, including the Linux kernel, gives evidence that require and exclude constraints are not sufficient to express real-world feature models. Hence, we promote that research on feature models needs to consider arbitrary propositional formulas as cross-tree constraints prospectively.

Adaptively generating high quality fixes for atomicity violations

It is difficult to fix atomicity violations correctly. Existing gate lock algorithm (GLA) simply inserts gate locks to serialize exe-cutions, which may introduce performance bugs and deadlocks. Synthesized context-aware gate locks (by Grail) require complex source code synthesis. We propose Fixer to adaptively fix ato-micity violations. It firstly analyses the lock acquisitions of an atomicity violation. Then it either adjusts the existing lock scope or inserts a gate lock. The former addresses cases where some locks are used but fail to provide atomic accesses. For the latter, it infers the visibility (being global or a field of a class/struct) of the gate lock such that the lock only protects related accesses. For both cases, Fixer further eliminates new lock orders to avoid introducing deadlocks. Of course, Fixer can produce both kinds of fixes on atomicity violations with locks. The experi-mental results on 15 previously used atomicity violations show that: Fixer correctly fixed all 15 atomicity violations without introducing deadlocks. However, GLA and Grail both intro-duced 5 deadlocks. HFix (that only targets on fixing certain types of atomicity violations) only fixed 2 atomicity violations and introduced 4 deadlocks. Fixer also provides an alternative way to insert gate locks (by inserting gate locks with proper visibility) considering fix acceptance.

AtexRace: across thread and execution sampling for in-house race detection

Data race is a major source of concurrency bugs. Dynamic data race detection tools (e.g., FastTrack) monitor the execu-tions of a program to report data races occurring in runtime. However, such tools incur significant overhead that slows down and perturbs executions. To address the issue, the state-of-the-art dynamic data race detection tools (e.g., LiteRace) ap-ply sampling techniques to selectively monitor memory access-es. Although they reduce overhead, they also miss many data races as confirmed by existing studies. Thus, practitioners face a dilemma on whether to use FastTrack, which detects more data races but is much slower, or LiteRace, which is faster but detects less data races. In this paper, we propose a new sam-pling approach to address the major limitations of current sampling techniques, which ignore the facts that a data race involves two threads and a program under testing is repeatedly executed. We develop a tool called AtexRace to sample memory accesses across both threads and executions. By selectively monitoring the pairs of memory accesses that have not been frequently observed in current and previous executions, AtexRace detects as many data races as FastTrack at a cost as low as LiteRace. We have compared AtexRace against FastTrack and LiteRace on both Parsec benchmark suite and a large-scale real-world MySQL Server with 223 test cases. The experiments confirm that AtexRace can be a replacement of FastTrack and LiteRace.

Symbolic execution of programmable logic controller code

Programmable logic controllers (PLCs) are specialized computers for automating a wide range of cyber-physical systems. Since these systems are often safety-critical, software running on PLCs need to be free of programming errors. However, automated tools for testing PLC software are lacking despite the pervasive use of PLCs in industry. We propose a symbolic execution based method, named SymPLC, for automatically testing PLC software written in programming languages specified in the IEC 61131-3 standard. SymPLC takes the PLC source code as input and translates it into C before applying symbolic execution, to systematically generate test inputs that cover both paths in each periodic task and interleavings of these tasks. Toward this end, we propose a number of PLC-specific reduction techniques for identifying and eliminating redundant interleavings. We have evaluated SymPLC on a large set of benchmark programs with both single and multiple tasks. Our experiments show that SymPLC can handle these programs efficiently, and for multi-task PLC programs, our new reduction techniques outperform the state-of-the-art partial order reduction technique by more than two orders of magnitude.

Thread-modular static analysis for relaxed memory models

We propose a memory-model-aware static program analysis method for accurately analyzing the behavior of concurrent software running on processors with weak consistency models such as x86-TSO, SPARC-PSO, and SPARC-RMO. At the center of our method is a unified framework for deciding the feasibility of inter-thread interferences to avoid propagating spurious data flows during static analysis and thus boost the performance of the static analyzer. We formulate the checking of interference feasibility as a set of Datalog rules which are both efficiently solvable and general enough to capture a range of hardware-level memory models. Compared to existing techniques, our method can significantly reduce the number of bogus alarms as well as unsound proofs. We implemented the method and evaluated it on a large set of multithreaded C programs. Our experiments show the method significantly outperforms state-of-the-art techniques in terms of accuracy with only moderate runtime overhead.

ARTINALI: dynamic invariant detection for cyber-physical system security

Cyber-Physical Systems (CPSes) are being widely deployed in security critical scenarios such as smart homes and medical devices. Unfortunately, the connectedness of these systems and their relative lack of security measures makes them ripe targets for attacks. Specification-based Intrusion Detection Systems (IDS) have been shown to be effective for securing CPSs. Unfortunately, deriving invariants for capturing the specifications of CPS systems is a tedious and error-prone process. Therefore, it is important to dynamically monitor the CPS system to learn its common behaviors and formulate invariants for detecting security attacks. Existing techniques for invariant mining only incorporate data and events, but not time. However, time is central to most CPS systems, and hence incorporating time in addition to data and events, is essential for achieving low false positives and false negatives. This paper proposes ARTINALI, which mines dynamic system properties by incorporating time as a first-class property of the system. We build ARTINALI-based Intrusion Detection Systems (IDSes) for two CPSes, namely smart meters and smart medical devices, and measure their efficacy. We find that the ARTINALI-based IDSes significantly reduce the ratio of false positives and false negatives by 16 to 48% (average 30.75%) and 89 to 95% (average 93.4%) respectively over other dynamic invariant detection tools.

A symbolic justice violations transition system for unrealizable GR(1) specifications

One of the main challenges of reactive synthesis, an automated procedure to obtain a correct-by-construction reactive system, is to deal with unrealizable specifications. Existing approaches to deal with unrealizability, in the context of GR(1), an expressive assume-guarantee fragment of LTL that enables efficient synthesis, include the generation of concrete counter-strategies and the computation of an unrealizable core. Although correct, such approaches produce large and complicated counter-strategies, often containing thousands of states. This hinders their use by engineers.

In this work we present the Justice Violations Transition System (JVTS), a novel symbolic representation of counter-strategies for GR(1). The JVTS is much smaller and simpler than its corresponding concrete counter-strategy. Moreover, it is annotated with invariants that explain how the counter-strategy forces the system to violate the specification. We compute the JVTS symbolically, and thus more efficiently, without the expensive enumeration of concrete states. Finally, we provide the JVTS with an on-demand interactive concrete and symbolic play.

We implemented our work, validated its correctness, and evaluated it on 14 unrealizable specifications of autonomous Lego robots as well as on benchmarks from the literature. The evaluation shows not only that the JVTS is in most cases much smaller than the corresponding concrete counter-strategy, but also that its computation is faster.

Automated control of multiple software goals using multiple actuators

Modern software should satisfy multiple goals simultaneously: it should provide predictable performance, be robust to failures, handle peak loads and deal seamlessly with unexpected conditions and changes in the execution environment. For this to happen, software designs should account for the possibility of runtime changes and provide formal guarantees of the software's behavior. Control theory is one of the possible design drivers for runtime adaptation, but adopting control theoretic principles often requires additional, specialized knowledge. To overcome this limitation, automated methodologies have been proposed to extract the necessary information from experimental data and design a control system for runtime adaptation. These proposals, however, only process one goal at a time, creating a chain of controllers. In this paper, we propose and evaluate the first automated strategy that takes into account multiple goals without separating them into multiple control strategies. Avoiding the separation allows us to tackle a larger class of problems and provide stronger guarantees. We test our methodology's generality with three case studies that demonstrate its broad applicability in meeting performance, reliability, quality, security, and energy goals despite environmental or requirements changes.

Why do developers use trivial packages? an empirical case study on npm

Code reuse is traditionally seen as good practice. Recent trends have pushed the concept of code reuse to an extreme, by using packages that implement simple and trivial tasks, which we call `trivial packages'. A recent incident where a trivial package led to the breakdown of some of the most popular web applications such as Facebook and Netflix made it imperative to question the growing use of trivial packages.

Therefore, in this paper, we mine more than 230,000 npm packages and 38,000 JavaScript applications in order to study the prevalence of trivial packages. We found that trivial packages are common and are increasing in popularity, making up 16.8% of the studied npm packages. We performed a survey with 88 Node.js developers who use trivial packages to understand the reasons and drawbacks of their use. Our survey revealed that trivial packages are used because they are perceived to be well implemented and tested pieces of code. However, developers are concerned about maintaining and the risks of breakages due to the extra dependencies trivial packages introduce. To objectively verify the survey results, we empirically validate the most cited reason and drawback and find that, contrary to developers' beliefs, only 45.2% of trivial packages even have tests. However, trivial packages appear to be `deployment tested' and to have similar test, usage and community interest as non-trivial packages. On the other hand, we found that 11.5% of the studied trivial packages have more than 20 dependencies. Hence, developers should be careful about which trivial packages they decide to use.

Detecting missing information in bug descriptions

Bug reports document unexpected software behaviors experienced by users. To be effective, they should allow bug triagers to easily understand and reproduce the potential reported bugs, by clearly describing the Observed Behavior (OB), the Steps to Reproduce (S2R), and the Expected Behavior (EB). Unfortunately, while considered extremely useful, reporters often miss such pieces of information in bug reports and, to date, there is no effective way to automatically check and enforce their presence. We manually analyzed nearly 3k bug reports to understand to what extent OB, EB, and S2R are reported in bug reports and what discourse patterns reporters use to describe such information. We found that (i) while most reports contain OB (i.e., 93.5%), only 35.2% and 51.4% explicitly describe EB and S2R, respectively; and (ii) reporters recurrently use 154 discourse patterns to describe such content. Based on these findings, we designed and evaluated an automated approach to detect the absence (or presence) of EB and S2R in bug descriptions. With its best setting, our approach is able to detect missing EB (S2R) with 85.9% (69.2%) average precision and 93.2% (83%) average recall. Our approach intends to improve bug descriptions quality by alerting reporters about missing EB and S2R at reporting time.

Continuous variable-specific resolutions of feature interactions

Systems that are assembled from independently developed features suffer from feature interactions, in which features affect one another's behaviour in surprising ways. The Feature Interaction Problem results from trying to implement an appropriate resolution for each interaction within each possible context, because the number of possible contexts to consider increases exponentially with the number of features in the system. Resolution strategies aim to combat the Feature Interaction Problem by offering default strategies that resolve entire classes of interactions, thereby reducing the work needed to resolve lots of interactions. However most such approaches employ coarse-grained resolution strategies (e.g., feature priority) or a centralized arbitrator.

Our work focuses on employing variable-specific default-resolution strategies that aim to resolve at runtime features- conflicting actions on a system's outputs. In this paper, we extend prior work to enable co-resolution of interactions on coupled output variables and to promote smooth continuous resolutions over execution paths. We implemented our approach within the PreScan simulator and performed a case study involving 15 automotive features; this entailed our devising and implementing three resolution strategies for three output variables. The results of the case study show that the approach produces smooth and continuous resolutions of interactions throughout interesting scenarios.

Model-level, platform-independent debugging in the context of the model-driven development of real-time systems

Providing proper support for debugging models at model-level is one of the main barriers to a broader adoption of Model Driven Development (MDD). In this paper, we focus on the use of MDD for the development of real-time embedded systems (RTE). We introduce a new platform-independent approach to implement model-level debuggers. We describe how to realize support for model-level debugging entirely in terms of the modeling language and show how to implement this support in terms of a model-to-model transformation. Key advantages of the approach over existing work are that (1) it does not require a program debugger for the code generated from the model, and that (2) any changes to, e.g., the code generator, the target language, or the hardware platform leave the debugger completely unaffected. We also describe an implementation of the approach in the context of Papyrus-RT, an open source MDD tool based on the modeling language UML-RT. We summarize the results of the use of our model-based debugger on several use cases to determine its overhead in terms of size and performance. Despite being a prototype, the performance overhead is in the order of microseconds, while the size overhead is comparable with that of GDB, the GNU Debugger.

Cooperative kernels: GPU multitasking for blocking algorithms

There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking, so they require fair scheduling. But GPU programming models (e.g. OpenCL) do not mandate fair scheduling, and GPU schedulers are unfair in practice. Current approaches avoid this issue by exploiting scheduling quirks of today's GPUs in a manner that does not allow the GPU to be shared with other workloads (such as graphics rendering tasks). We propose cooperative kernels, an extension to the traditional GPU programming model geared towards writing blocking algorithms. Workgroups of a cooperative kernel are fairly scheduled, and multitasking is supported via a small set of language extensions through which the kernel and scheduler cooperate. We describe a prototype implementation of a cooperative kernel framework implemented in OpenCL 2.0 and evaluate our approach by porting a set of blocking GPU applications to cooperative kernels and examining their performance under multitasking. Our prototype exploits no vendor-specific hardware, driver or compiler support, thus our results provide a lower-bound on the efficiency with which cooperative kernels can be implemented in practice.

Toward full elasticity in distributed static analysis: the case of callgraph analysis

In this paper we present the design and implementation of a distributed, whole-program static analysis framework that is designed to scale with the size of the input. Our approach is based on the actor programming model and is deployed in the cloud. Our reliance on a cloud cluster provides a degree of elasticity for CPU, memory, and storage resources. To demonstrate the potential of our technique, we show how a typical call graph analysis can be implemented in a distributed setting. The vision that motivates this work is that every large-scale software repository such as GitHub, BitBucket, or Visual Studio Online will be able to perform static analysis on a large scale. We experimentally validate our implementation of the distributed call graph analysis using a combination of both synthetic and real benchmarks. To show scalability, we demonstrate how the analysis presented in this paper is able to handle inputs that are almost 10 million lines of code (LOC) in size, without running out of memory. Our results show that the analysis scales well in terms of memory pressure independently of the input size, as we add more virtual machines (VMs). As the number of worker VMs increases, we observe that the analysis time generally improves as well. Lastly, we demonstrate that querying the results can be performed with a median latency of 15 ms.

Probabilistic model checking of perturbed MDPs with applications to cloud computing

Probabilistic model checking is a formal verification technique that has been applied successfully in a variety of domains, providing identification of system errors through quantitative verification of stochastic system models. One domain that can benefit from probabilistic model checking is cloud computing, which must provide highly reliable and secure computational and storage services to large numbers of mission-critical software systems.

For real-world domains like cloud computing, external system factors and environmental changes must be estimated accurately in the form of probabilities in system models; inaccurate estimates for the model probabilities can lead to invalid verification results. To address the effects of uncertainty in probability estimates, in previous work we have developed a variety of techniques for perturbation analysis of discrete- and continuous-time Markov chains (DTMCs and CTMCs). These techniques determine the consequences of the uncertainty on verification of system properties. In this paper, we present the first approach for perturbation analysis of Markov decision processes (MDPs), a stochastic formalism that is especially popular due to the significant expressive power it provides through the combination of both probabilistic and nondeterministic choice.

Our primary contribution is a novel technique for efficiently analyzing the effects of perturbations of model probabilities on verification of reachability properties of MDPs. The technique heuristically explores the space of adversaries of an MDP, which encode the different ways of resolving the MDP's nondeterministic choices. We demonstrate the practical effectiveness of our approach by applying it to two case studies of cloud systems.

Understanding the impact of refactoring on smells: a longitudinal study of 23 software projects

Code smells in a program represent indications of structural quality problems, which can be addressed by software refactoring. However, refactoring intends to achieve different goals in practice, and its application may not reduce smelly structures. Developers may neglect or end up creating new code smells through refactoring. Unfortunately, little has been reported about the beneficial and harmful effects of refactoring on code smells. This paper reports a longitudinal study intended to address this gap. We analyze how often commonly-used refactoring types affect the density of 13 types of code smells along the version histories of 23 projects. Our findings are based on the analysis of 16,566 refactorings distributed in 10 different types. Even though 79.4% of the refactorings touched smelly elements, 57% did not reduce their occurrences. Surprisingly, only 9.7% of refactorings removed smells, while 33.3% induced the introduction of new ones. More than 95% of such refactoring-induced smells were not removed in successive commits, which suggest refactorings tend to more frequently introduce long-living smells instead of eliminating existing ones. We also characterized and quantified typical refactoring-smell patterns, and observed that harmful patterns are frequent, including: (i) approximately 30% of the Move Method and Pull Up Method refactorings induced the emergence of God Class, and (ii) the Extract Superclass refactoring creates the smell Speculative Generality in 68% of the cases.

Cimplifier: automatically debloating containers

Application containers, such as those provided by Docker, have recently gained popularity as a solution for agile and seamless software deployment. These light-weight virtualization environments run applications that are packed together with their resources and configuration information, and thus can be deployed across various software platforms. Unfortunately, the ease with which containers can be created is oftentimes a double-edged sword, encouraging the packaging of logically distinct applications, and the inclusion of significant amount of unnecessary components, within a single container. These practices needlessly increase the container size-sometimes by orders of magnitude. They also decrease the overall security, as each included component-necessary or not-may bring in security issues of its own, and there is no isolation between multiple applications packaged within the same container image. We propose algorithms and a tool called Cimplifier, which address these concerns: given a container and simple user-defined constraints, our tool partitions it into simpler containers, which (i) are isolated from each other, only communicating as necessary, and (ii) only include enough resources to perform their functionality. Our evaluation on real-world containers demonstrates that Cimplifier preserves the original functionality, leads to reduction in image size of up to 95%, and processes even large containers in under thirty seconds.

Craig vs. Newton in software model checking

Ever since the seminal work on SLAM and BLAST, software model checking with counterexample-guided abstraction refinement (CEGAR) has been an active topic of research. The crucial procedure here is to analyze a sequence of program statements (the counterexample) to find building blocks for the overall proof of the program. We can distinguish two approaches (which we name Craig and Newton) to implement the procedure. The historically first approach, Newton (named after the tool from the SLAM toolkit), is based on symbolic execution. The second approach, Craig, is based on Craig interpolation. It was widely believed that Craig is substantially more effective than Newton. In fact, 12 out of the 15 CEGAR-based tools in SV-COMP are based on Craig. Advances in software model checkers based on Craig, however, can go only lockstep with advances in SMT solvers with Craig interpolation. It may be time to revisit Newton and ask whether Newton can be as effective as Craig. We have implemented a total of 11 variants of Craig and Newton in two different state-of-the-art software model checking tools and present the outcome of our experimental comparison.

Fairness testing: testing software for discrimination

This paper defines software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates, focusing on causality in discriminatory behavior. Evidence of software discrimination has been found in modern software systems that recommend criminal sentences, grant access to financial products, and determine who is allowed to participate in promotions. Our approach, Themis, generates efficient test suites to measure discrimination. Given a schema describing valid system inputs, Themis generates discrimination tests automatically and does not require an oracle. We evaluate Themis on 20 software systems, 12 of which come from prior work with explicit focus on avoiding discrimination. We find that (1) Themis is effective at discovering software discrimination, (2) state-of-the-art techniques for removing discrimination from algorithms fail in many situations, at times discriminating against as much as 98% of an input subdomain, (3) Themis optimizations are effective at producing efficient test suites for measuring discrimination, and (4) Themis is more efficient on systems that exhibit more discrimination. We thus demonstrate that fairness testing is a critical aspect of the software development cycle in domains with possible discrimination and provide initial tools for measuring software discrimination.

The care and feeding of wild-caught mutants

Mutation testing of a test suite and a program provides a way to measure the quality of the test suite. In essence, mutation testing is a form of sensitivity testing: by running mutated versions of the program against the test suite, mutation testing measures the suite's sensitivity for detecting bugs that a programmer might introduce into the program. This paper introduces a technique to improve mutation testing that we call wild-caught mutants; it provides a method for creating potential faults that are more closely coupled with changes made by actual programmers. This technique allows the mutation tester to have more certainty that the test suite is sensitive to the kind of changes that have been observed to have been made by programmers in real-world cases.

QTEP: quality-aware test case prioritization

Test case prioritization (TCP) is a practical activity in software testing for exposing faults earlier. Researchers have proposed many TCP techniques to reorder test cases. Among them, coverage-based TCPs have been widely investigated. Specifically, coverage-based TCP approaches leverage coverage information between source code and test cases, i.e., static code coverage and dynamic code coverage, to schedule test cases. Existing coverage-based TCP techniques mainly focus on maximizing coverage while often do not consider the likely distribution of faults in source code. However, software faults are not often equally distributed in source code, e.g., around 80% faults are located in about 20% source code. Intuitively, test cases that cover the faulty source code should have higher priorities, since they are more likely to find faults.

In this paper, we present a quality-aware test case prioritization technique, QTEP, to address the limitation of existing coverage-based TCP algorithms. In QTEP, we leverage code inspection techniques, i.e., a typical statistic defect prediction model and a typical static bug finder, to detect fault-prone source code and then adapt existing coverage-based TCP algorithms by considering the weighted source code in terms of fault-proneness. Our evaluation with 16 variant QTEP techniques on 33 different versions of 7 open source Java projects shows that QTEP could improve existing coverage-based TCP techniques for both regression and new test cases. Specifically, the improvement of the best variant of QTEP for regression test cases could be up to 15.0% and on average 7.6%, and for all test cases (both regression and new test cases), the improvement could be up to 10.0% and on average 5.0%.

Constraint normalization and parameterized caching for quantitative program analysis

Symbolic program analysis techniques rely on satisfiability-checking constraint solvers, while quantitative program analysis techniques rely on model-counting constraint solvers. Hence, the efficiency of satisfiability checking and model counting is crucial for efficiency of modern program analysis techniques. In this paper, we present a constraint caching framework to expedite potentially expensive satisfiability and model-counting queries. Integral to this framework is our new constraint normalization procedure under which the cardinality of the solution set of a constraint, but not necessarily the solution set itself, is preserved. We extend these constraint normalization techniques to string constraints in order to support analysis of string-manipulating code. A group-theoretic framework which generalizes earlier results on constraint normalization is used to express our normalization techniques. We also present a parameterized caching approach where, in addition to storing the result of a model-counting query, we also store a model-counter object in the constraint store that allows us to efficiently recount the number of satisfying models for different maximum bounds. We implement our caching framework in our tool Cashew, which is built as an extension of the Green caching framework, and integrate it with the symbolic execution tool Symbolic PathFinder (SPF) and the model-counting constraint solver ABC. Our experiments show that constraint caching can significantly improve the performance of symbolic and quantitative program analyses. For instance, Cashew can normalize the 10,104 unique constraints in the SMC/Kaluza benchmark down to 394 normal forms, achieve a 10x speedup on the SMC/Kaluza-Big dataset, and an average 3x speedup in our SPF-based side-channel analysis experiments.

Generalized observational slicing for tree-represented modelling languages

Model-driven software engineering raises the abstraction level making complex systems easier to understand than if written in textual code. Nevertheless, large complicated software systems can have large models, motivating the need for slicing techniques that reduce the size of a model. We present a generalization of observation-based slicing that allows the criterion to be defined using a variety of kinds of observable behavior and does not require any complex dependence analysis. We apply our implementation of generalized observational slicing for tree-structured representations to Simulink models. The resulting slice might be the subset of the original model responsible for an observed failure or simply the sub-model semantically related to a classic slicing criterion. Unlike its predecessors, the algorithm is also capable of slicing embedded Stateflow state machines. A study of nine real-world models drawn from four different application domains demonstrates the effectiveness of our approach at dramatically reducing Simulink model sizes for realistic observation scenarios: for 9 out of 20 cases, the resulting model has fewer than 25% of the original model's elements.

On evidence preservation requirements for forensic-ready systems

Forensic readiness denotes the capability of a system to support digital forensic investigations of potential, known incidents by preserving in advance data that could serve as evidence explaining how an incident occurred. Given the increasing rate at which (potentially criminal) incidents occur, designing so‰ware systems that are forensic-ready can facilitate and reduce the costs of digital forensic investigations. However, to date, little or no attention has been given to how forensic-ready so‰ftware systems can be designed systematically. In this paper we propose to explicitly represent evidence preservation requirements prescribing preservation of the minimal amount of data that would be relevant to a future digital investigation. We formalise evidence preservation requirements and propose an approach for synthesising specifications for systems to meet these requirements. We present our prototype implementation—based on a satisfiability solver and a logic-based learner—which we use to evaluate our approach, applying it to two digital forensic corpora. Our evaluation suggests that our approach preserves relevant data that could support hypotheses of potential incidents. Moreover, it enables significant reduction in the volume of data that would need to be examined during an investigation.

BDCI: behavioral driven conflict identification

Source Code Management (SCM) systems support software evolution by providing features, such as version control, branching, and conflict detection. Despite the presence of these features, support to parallel software development is often limited. SCM systems can only address a subset of the conflicts that might be introduced by developers when concurrently working on multiple parallel branches. In fact, SCM systems can detect textual conflicts, which are generated by the concurrent modification of the same program locations, but they are unable to detect higher-order conflicts, which are generated by the concurrent modification of different program locations that generate program misbehaviors once merged. Higher-order conflicts are painful to detect and expensive to fix because they might be originated by the interference of apparently unrelated changes.

In this paper we present Behavioral Driven Conflict Identification (BDCI), a novel approach to conflict detection. BDCI moves the analysis of conflicts from the source code level to the level of program behavior by generating and comparing behavioral models. The analysis based on behavioral models can reveal interfering changes as soon as they are introduced in the SCM system, even if they do not introduce any textual conflict.

To evaluate the effectiveness and the cost of the proposed approach, we developed BDCIf, a specific instance of BDCI dedicated to the detection of higher-order conflicts related to the functional behavior of a program. The evidence collected by analyzing multiple versions of Git and Redis suggests that BDCIf can effectively detect higher-order conflicts and report how changes might interfere.

NoFAQ: synthesizing command repairs from examples

Command-line tools are confusing and hard to use due to their cryptic error messages and lack of documentation. Novice users often resort to online help-forums for finding corrections to their buggy commands, but have a hard time in searching precisely for posts that are relevant to their problem and then applying the suggested solutions to their buggy command. We present NoFAQ, a tool that uses a set of rules to suggest possible fixes when users write buggy commands that trigger commonly occurring errors. The rules are expressed in a language called FIXIT and each rule pattern-matches against the user's buggy command and corresponding error message, and uses these inputs to produce a possible fixed command. NoFAQ automatically learns FIXIT rules from examples of buggy and repaired commands. We evaluate NoFAQ on two fronts. First, we use 92 benchmark problems drawn from an existing tool and show that NoFAQ is able to synthesize rules for 81 benchmark problems in real time using just 2 to 5 input-output examples for each rule. Second, we run our learning algorithm on the examples obtained through a crowd-sourcing interface and show that the learning algorithm scales to large sets of examples.

S3: syntax- and semantic-guided repair synthesis via programming by examples

A notable class of techniques for automatic program repair is known as semantics-based. Such techniques, e.g., Angelix, infer semantic specifications via symbolic execution, and then use program synthesis to construct new code that satisfies those inferred specifications. However, the obtained specifications are naturally incomplete, leaving the synthesis engine with a difficult task of synthesizing a general solution from a sparse space of many possible solutions that are consistent with the provided specifications but that do not necessarily generalize. We present S3, a new repair synthesis engine that leverages programming-by-examples methodology to synthesize high-quality bug repairs. The novelty in S3 that allows it to tackle the sparse search space to create more general repairs is three-fold: (1) A systematic way to customize and constrain the syntactic search space via a domain-specific language, (2) An efficient enumeration- based search strategy over the constrained search space, and (3) A number of ranking features based on measures of the syntactic and semantic distances between candidate solutions and the original buggy program. We compare S3’s repair effectiveness with state-of-the-art synthesis engines Angelix, Enumerative, and CVC4. S3 can successfully and correctly fix at least three times more bugs than the best baseline on datasets of 52 bugs in small programs, and 100 bugs in real-world large programs.

Counterexample-guided approach to finding numerical invariants

Numerical invariants, e.g., relationships among numerical variables in a program, represent a useful class of properties to analyze programs. General polynomial invariants represent more complex numerical relations, but they are often required in many scientific and engineering applications. We present NumInv, a tool that implements a counterexample-guided invariant generation (CEGIR) technique to automatically discover numerical invariants, which are polynomial equality and inequality relations among numerical variables. This CEGIR technique infers candidate invariants from program traces and then checks them against the program source code using the KLEE test-input generation tool. If the invariants are incorrect KLEE returns counterexample traces, which help the dynamic inference obtain better results. Existing CEGIR approaches often require sound invariants, however NumInv sacrifices soundness and produces results that KLEE cannot refute within certain time bounds. This design and the use of KLEE as a verifier allow NumInv to discover useful and important numerical invariants for many challenging programs.

Preliminary results show that NumInv generates required invariants for understanding and verifying correctness of programs involving complex arithmetic. We also show that NumInv discovers polynomial invariants that capture precise complexity bounds of programs used to benchmark existing static complexity analysis techniques. Finally, we show that NumInv performs competitively comparing to state of the art numerical invariant analysis tools.

Discovering relational specifications

Formal specifications of library functions play a critical role in a number of program analysis and development tasks. We present Bach, a technique for discovering likely relational specifications from data describing input-output behavior of a set of functions comprising a library or a program. Relational specifications correlate different executions of different functions; for instance, commutativity, transitivity, equivalence of two functions, etc. Bach combines novel insights from program synthesis and databases to discover a rich array of specifications. We apply Bach to learn specifications from data generated for a number of standard libraries. Our experimental evaluation demonstrates Bach's ability to learn useful and deep specifications in a small amount of time.

Steelix: program-state based binary fuzzing

Coverage-based fuzzing is one of the most effective techniques to find vulnerabilities, bugs or crashes. However, existing techniques suffer from the difficulty in exercising the paths that are protected by magic bytes comparisons (e.g., string equality comparisons). Several approaches have been proposed to use heavy-weight program analysis to break through magic bytes comparisons, and hence are less scalable. In this paper, we propose a program-state based binary fuzzing approach, named Steelix, which improves the penetration power of a fuzzer at the cost of an acceptable slow down of the execution speed. In particular, we use light-weight static analysis and binary instrumentation to provide not only coverage information but also comparison progress information to a fuzzer. Such program state information informs a fuzzer about where the magic bytes are located in the test input and how to perform mutations to match the magic bytes efficiently. We have implemented Steelix and evaluated it on three datasets: LAVA-M dataset, DARPA CGC sample binaries and five real-life programs. The results show that Steelix has better code coverage and bug detection capability than the state-of-the-art fuzzers. Moreover, we found one CVE and nine new bugs.

CodeMatch: obfuscation won't conceal your repackaged app

An established way to steal the income of app developers, or to trick users into installing malware, is the creation of repackaged apps. These are clones of - typically - successful apps. To conceal their nature, they are often obfuscated by their creators. But, given that it is a common best practice to obfuscate apps, a trivial identification of repackaged apps is not possible. The problem is further intensified by the prevalent usage of libraries. In many apps, the size of the overall code base is basically determined by the used libraries. Therefore, two apps, where the obfuscated code bases are very similar, do not have to be repackages of each other. To reliably detect repackaged apps, we propose a two step approach which first focuses on the identification and removal of the library code in obfuscated apps. This approach - LibDetect - relies on code representations which abstract over several parts of the underlying bytecode to be resilient against certain obfuscation techniques. Using this approach, we are able to identify on average 70% more used libraries per app than previous approaches. After the removal of an app's library code, we then fuzzy hash the most abstract representation of the remaining app code to ensure that we can identify repackaged apps even if very advanced obfuscation techniques are used. This makes it possible to identify repackaged apps. Using our approach, we found that ≈ 15% of all apps in Android app stores are repackages

A compiler and verifier for page access oblivious computation

Trusted hardware primitives such as Intel's SGX instructions provide applications with a protected address space, called an enclave, for trusted code and data. However, building enclaves that preserve confidentiality of sensitive data continues to be a challenge. The developer must not only avoid leaking secrets via the enclave's outputs but also prevent leaks via side channels induced by interactions with the untrusted platform. Recent attacks have demonstrated that simply observing the page faults incurred during an enclave's execution can reveal its secrets if the enclave makes data accesses or control flow decisions based on secret values. To address this problem, a developer needs compilers to automatically produce confidential programs, and verification tools to certify the absence of secret-dependent page access patterns (a property that we formalize as page-access obliviousness). To that end, we implement an efficient compiler for a type and memory-safe language, a compiler pass that enforces page-access obliviousness with low runtime overheads, and an automatic, modular verifier that certifies page-access obliviousness at the machine-code level, thus removing the compiler from our trusted computing base. We evaluate this toolchain on several machine learning algorithms and image processing routines that we run within SGX enclaves.

Automatic generation of inter-component communication exploits for Android applications

Although a wide variety of approaches identify vulnerabilities in Android apps, none attempt to determine exploitability of those vulnerabilities. Exploitability can aid in reducing false positives of vulnerability analysis, and can help engineers triage bugs. Specifically, one of the main attack vectors of Android apps is their inter-component communication interface, where apps may receive messages called Intents. In this paper, we provide the first approach for automatically generating exploits for Android apps, called LetterBomb, relying on a combined path-sensitive symbolic execution-based static analysis, and the use of software instrumentation and test oracles. We run LetterBomb on 10,000 Android apps from Google Play, where we identify 181 exploits from 835 vulnerable apps. Compared to a state-of-the-art detection approach for three ICC-based vulnerabilities, LetterBomb obtains 33%-60% more vulnerabilities at a 6.66 to 7 times faster speed.

OASIS: prioritizing static analysis warnings for Android apps based on app user reviews

Lint is a widely-used static analyzer for detecting bugs/issues in Android apps. However, it can generate many false warnings. One existing solution to this problem is to leverage project history data (e.g., bug fixing statistics) for warning prioritization. Unfortunately, such techniques are biased toward a project’s archived warnings and can easily miss newissues. Anotherweakness is that developers cannot readily relate the warnings to the impacts perceivable by users. To overcome these weaknesses, in this paper, we propose a semantics-aware approach, OASIS, to prioritizing Lint warnings by leveraging app user reviews. OASIS combines program analysis and NLP techniques to recover the intrinsic links between the Lint warnings for a given app and the user complaints on the app problems caused by the issues of concern. OASIS leverages the strength of such links to prioritize warnings. We evaluated OASIS on six popular and large-scale open-source Android apps. The results show that OASIS can effectively prioritize Lint warnings and help identify new issues that are previously-unknown to app developers.

Recovering clear, natural identifiers from obfuscated JS names

Well-chosen variable names are critical to source code readability, reusability, and maintainability. Unfortunately, in deployed JavaScript code (which is ubiquitous on the web) the identifier names are frequently minified and overloaded. This is done both for efficiency and also to protect potentially proprietary intellectual property. In this paper, we describe an approach based on statistical machine translation (SMT) that recovers some of the original names from the JavaScript programs minified by the very popular UglifyJS. This simple tool, Autonym, performs comparably to the best currently available deobfuscator for JavaScript, JSNice, which uses sophisticated static analysis. In fact, Autonym is quite complementary to JSNice, performing well when it does not, and vice versa. We also introduce a new tool, JSNaughty, which blends Autonym and JSNice, and significantly outperforms both at identifier name recovery, while remaining just as easy to use as JSNice. JSNaughty is available online at

DESCRY: reproducing system-level concurrency failures

Concurrent systems may fail in the field due to various elusive faults such as race conditions. Reproducing such failures is hard because (1) concurrency failures at the system level often involve multiple processes or event handlers (e.g., software signals), which cannot be handled by existing tools for reproducing intra-process (thread-level) failures; (2) detailed field data, such as user input, file content and interleaving schedule, may not be available to developers; and (3) the debugging environment may differ from the deployed environment, which further complicates failure reproduction. To address these problems, we present DESCRY, the first fully automated tool for reproducing system-level concurrency failures based only on default log messages collected from the field. DESCRY uses a combination of static and dynamic analysis techniques, together with symbolic execution, to synthesize both the failure-inducing data input and the interleaving schedule, and leverages them to deterministically replay the failed execution using existing virtual platforms. We have evaluated DESCRY on 22 real-world multi-process Linux applications with a total of 236,875 lines of code to demonstrate both its effectiveness and its efficiency in reproducing failures that no other tool can reproduce.

Reproducing concurrency failures from crash stacks

Reproducing field failures is the first essential step for understanding, localizing and removing faults. Reproducing concurrency field failures is hard due to the need of synthesizing a test code jointly with a thread interleaving that induce the failure in the presence of limited information from the field. Current techniques for reproducing concurrency failures focus on identifying failure-inducing interleavings, leaving largely open the problem of synthesizing the test code that manifests such interleavings. In this paper, we present ConCrash, a technique to automatically generate test codes that reproduce concurrency failures that violate thread-safety from crash stacks, which commonly summarize the conditions of field failures. ConCrash efficiently explores the huge space of possible test codes to identify a failure-inducing one by using a suitable set of search pruning strategies. Combined with existing techniques for exploring interleavings, ConCrash automatically reproduces a given concurrency failure that violates the thread-safety of a class by identifying both a failure-inducing test code and corresponding interleaving. In the paper, we define the ConCrash approach, present a prototype implementation of ConCrash, and discuss the experimental results that we obtained on a known set of ten field failures that witness the effectiveness of the approach.

Automatically analyzing groups of crashes for finding correlations

We devised an algorithm, inspired by contrast-set mining algorithms such as STUCCO, to automatically find statistically significant properties (correlations) in crash groups. Many earlier works focused on improving the clustering of crashes but, to the best of our knowledge, the problem of automatically describing properties of a cluster of crashes is so far unexplored. This means developers currently spend a fair amount of time analyzing the groups themselves, which in turn means that a) they are not spending their time actually developing a fix for the crash; and b) they might miss something in their exploration of the crash data (there is a large number of attributes in crash reports and it is hard and error-prone to manually analyze everything). Our algorithm helps developers and release managers understand crash reports more easily and in an automated way, helping in pinpointing the root cause of the crash. The tool implementing the algorithm has been deployed on Mozilla's crash reporting service.

Automatic inference of code transforms for patch generation

We present a new system, Genesis, that processes human patches to automatically infer code transforms for automatic patch generation. We present results that characterize the effectiveness of the Genesis inference algorithms and the complete Genesis patch generation system working with real-world patches and defects collected from 372 Java projects. To the best of our knowledge, Genesis is the first system to automatically infer patch generation transforms or candidate patch search spaces from previous successful patches.

A feasibility study of using automated program repair for introductory programming assignments

Despite the fact an intelligent tutoring system for programming (ITSP) education has long attracted interest, its widespread use has been hindered by the difficulty of generating personalized feedback automatically. Meanwhile, automated program repair (APR) is an emerging new technology that automatically fixes software bugs, and it has been shown that APR can fix the bugs of large real-world software. In this paper, we study the feasibility of marrying intelligent programming tutoring and APR. We perform our feasibility study with four state-of-the-art APR tools (GenProg, AE, Angelix, and Prophet), and 661 programs written by the students taking an introductory programming course. We found that when APR tools are used out of the box, only about 30% of the programs in our dataset are repaired. This low repair rate is largely due to the student programs often being significantly incorrect - in contrast, professional software for which APR was successfully applied typically fails only a small portion of tests. To bridge this gap, we adopt in APR a new repair policy akin to the hint generation policy employed in the existing ITSP. This new repair policy admits partial repairs that address part of failing tests, which results in 84% improvement of repair rate. We also performed a user study with 263 novice students and 37 graders, and identified an understudied problem; while novice students do not seem to know how to effectively make use of generated repairs as hints, the graders do seem to gain benefits from repairs.

Automatically diagnosing and repairing error handling bugs in C

Correct error handling is essential for building reliable and secure systems. Unfortunately, low-level languages like C often do not support any error handling primitives and leave it up to the developers to create their own mechanisms for error propagation and handling. However, in practice, the developers often make mistakes while writing the repetitive and tedious error handling code and inadvertently introduce bugs. Such error handling bugs often have severe consequences undermining the security and reliability of the affected systems. Fixing these bugs is also tiring-they are repetitive and cumbersome to implement. Therefore, it is crucial to develop tool supports for automatically detecting and fixing error handling bugs.

To understand the nature of error handling bugs that occur in widely used C programs, we conduct a comprehensive study of real world error handling bugs and their fixes. Leveraging the knowledge, we then design, implement, and evaluate ErrDoc, a tool that not only detects and characterizes different types of error handling bugs but also automatically fixes them. Our evaluation on five open-source projects shows that ErrDoc can detect error handling bugs with 100% to 84% precision and around 95% recall, and categorize them with 83% to 96% precision and above 90% recall. Thus, ErrDoc improves precision up to 5 percentage points, and recall up to 44 percentage points w.r.t. the state-of-the-art. We also demonstrate that ErrDoc can fix the bugs with high accuracy.

Are deep neural networks the best choice for modeling source code?

Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source code. We argue here that the special properties of source code can be exploited for further improvements. In this work, we enhance established language modeling approaches to handle the special challenges of modeling source code, such as: frequent changes, larger, changing vocabularies, deeply nested scopes, etc. We present a fast, nested language modeling toolkit specifically designed for software, with the ability to add & remove text, and mix & swap out many models. Specifically, we improve upon prior cache-modeling work and present a model with a much more expansive, multi-level notion of locality that we show to be well-suited for modeling software. We present results on varying corpora in comparison with traditional N-gram, as well as RNN, and LSTM deep-learning language models, and release all our source code for public use. Our evaluations suggest that carefully adapting N-gram models for source code can yield performance that surpasses even RNN and LSTM based deep-learning models.

Understanding the impact of support for iteration on code search

Sometimes, when programmers use a search engine they know more or less what they need. Other times, programmers use the search engine to look around and generate possible ideas for the programming problem they are working on. The key insight we explore in this paper is that the results found in the latter case tend to serve as inspiration or triggers for the next queries issued. We introduce two search engines, CodeExchange and CodeLikeThis, both of which are specifically designed to enable the user to directly leverage the results in formulating the next query. CodeExchange does this with a set of four features supporting the programmer to use characteristics of the results to find other code with or without those characteristics. CodeLikeThis supports simply selecting an entire result to find code that is analogous, to some degree, to that result. We evaluated how these approaches were used along with two approaches not explicitly supporting iteration, a baseline and Google, in a user study among 24 developers. We find that search engines that support using results to form the next query can improve the programmers’ search experience and different approaches to iteration can provide better experiences depending on the task.

LAMP: data provenance for graph based machine learning algorithms through derivative computation

Data provenance tracking determines the set of inputs related to a given output. It enables quality control and problem diagnosis in data engineering. Most existing techniques work by tracking program dependencies. They cannot quantitatively assess the importance of related inputs, which is critical to machine learning algorithms, in which an output tends to depend on a huge set of inputs while only some of them are of importance. In this paper, we propose LAMP, a provenance computation system for machine learning algorithms. Inspired by automatic differentiation (AD), LAMP quantifies the importance of an input for an output by computing the partial derivative. LAMP separates the original data processing and the more expensive derivative computation to different processes to achieve cost-effectiveness. In addition, it allows quantifying importance for inputs related to discrete behavior, such as control flow selection. The evaluation on a set of real world programs and data sets illustrates that LAMP produces more precise and succinct provenance than program dependence based techniques, with much less overhead. Our case studies demonstrate the potential of LAMP in problem diagnosis in data engineering.

More accurate recommendations for method-level changes

During the life span of large software projects, developers often apply the same code changes to different code locations in slight variations. Since the application of these changes to all locations is time-consuming and error-prone, tools exist that learn change patterns from input examples, search for possible pattern applications, and generate corresponding recommendations. In many cases, the generated recommendations are syntactically or semantically wrong due to code movements in the input examples. Thus, they are of low accuracy and developers cannot directly copy them into their projects without adjustments.

We present the Accurate REcommendation System (ARES) that achieves a higher accuracy than other tools because its algorithms take care of code movements when creating patterns and recommendations. On average, the recommendations by ARES have an accuracy of 96% with respect to code changes that developers have manually performed in commits of source code archives. At the same time ARES achieves precision and recall values that are on par with other tools.

Regression test selection across JVM boundaries

Modern software development processes recommend that changes be integrated into the main development line of a project multiple times a day. Before a new revision may be integrated, developers practice regression testing to ensure that the latest changes do not break any previously established functionality. The cost of regression testing is high, due to an increase in the number of revisions that are introduced per day, as well as the number of tests developers write per revision. Regression test selection (RTS) optimizes regression testing by skipping tests that are not affected by recent project changes. Existing dynamic RTS techniques support only projects written in a single programming language, which is unfortunate knowing that an open-source project is on average written in several programming languages.

We present the first dynamic RTS technique that does not stop at predefined language boundaries. Our technique dynamically detects, at the operating system level, all file artifacts a test depends on. Our technique is, hence, oblivious to the specific means the test uses to actually access the files: be it through spawning a new process, invoking a system call, invoking a library written in a different language, invoking a library that spawns a process which makes a system call, etc. We also provide a set of extension points which allow for a smooth integration with testing frameworks and build systems. We implemented our technique in a tool called RTSLinux as a loadable Linux kernel module and evaluated it on 21 Java projects that escape JVM by spawning new processes or invoking native code, totaling 2,050,791 lines of code. Our results show that RTSLinux, on average, skips 74.17% of tests and saves 52.83% of test execution time compared to executing all tests.

Measuring the cost of regression testing in practice: a study of Java projects using continuous integration

Software defects cost time and money to diagnose and fix. Consequently, developers use a variety of techniques to avoid introducing defects into their systems. However, these techniques have costs of their own; the benefit of using a technique must outweigh the cost of applying it.

In this paper we investigate the costs and benefits of automated regression testing in practice. Specifically, we studied 61 projects that use Travis CI, a cloud-based continuous integration tool, in order to examine real test failures that were encountered by the developers of those projects. We determined how the developers resolved the failures they encountered and used this information to classify the failures as being caused by a flaky test, by a bug in the system under test, or by a broken or obsolete test. We consider that test failures caused by bugs represent a benefit of the test suite, while failures caused by broken or obsolete tests represent a test suite maintenance cost.

We found that 18% of test suite executions fail and that 13% of these failures are flaky. Of the non-flaky failures, only 74% were caused by a bug in the system under test; the remaining 26% were due to incorrect or obsolete tests. In addition, we found that, in the failed builds, only 0.38% of the test case executions failed and 64% of failed builds contained more than one failed test.

Our findings contribute to a wider understanding of the unforeseen costs that can impact the overall cost effectiveness of regression testing in practice. They can also inform research into test case selection techniques, as we have provided an approximate empirical bound on the practical value that could be extracted from such techniques. This value appears to be large, as the 61 systems under study contained nearly 3 million lines of test code and yet over 99% of test case executions could have been eliminated with a perfect oracle.

Better test cases for better automated program repair

Automated generate-and-validate program repair techniques (G&V techniques) suffer from generating many overfitted patches due to in-capabilities of test cases. Such overfitted patches are incor- rect patches, which only make all given test cases pass, but fail to fix the bugs. In this work, we propose an overfitted patch detec- tion framework named Opad (Overfitted PAtch Detection). Opad helps improve G&V techniques by enhancing existing test cases to filter out overfitted patches. To enhance test cases, Opad uses fuzz testing to generate new test cases, and employs two test or- acles (crash and memory-safety) to enhance validity checking of automatically-generated patches. Opad also uses a novel metric (named O-measure) for deciding whether automatically-generated patches overfit. Evaluated on 45 bugs from 7 large systems (the same benchmark used by GenProg and SPR), Opad filters out 75.2% (321/427) over- fitted patches generated by GenProg/AE, Kali, and SPR. In addition, Opad guides SPR to generate correct patches for one more bug (the original SPR generates correct patches for 11 bugs). Our analysis also shows that up to 40% of such automatically-generated test cases may further improve G&V techniques if empowered with better test oracles (in addition to crash and memory-safety oracles employed by Opad).

SESSION: Testing and Security in the Real World

When program analysis meets mobile security: an industrial study of misusing Android internet sockets

Despite recent progress in program analysis techniques to identify vulnerabilities in Android apps, significant challenges still remain for applying these techniques to large-scale industrial environments. Modern software-security providers, such as Qihoo 360 and Pwnzen (two leading companies in China), are often required to process more than 10 million mobile apps at each run. In this work, we focus on effectively and efficiently identifying vulnerable usage of Internet sockets in an industrial setting. To achieve this goal, we propose a practical hybrid approach that enables lightweight yet precise detection in the industrial setting. In particular, we integrate the process of categorizing potential vulnerable apps with analysis techniques, to reduce the inevitable human inspection effort. We categorize potential vulnerable apps based on characteristics of vulnerability signatures, to reduce the burden on static analysis. We flexibly integrate static and dynamic analyses for apps in each identified family, to refine the family signatures and hence target on precise detection. We implement our approach in a practical system and deploy the system on the Pwnzen platform. By using the system, we identify and report potential vulnerabilities of 24 vulnerable apps (falling into 3 vulnerability families) to their developers, and some of these reported vulnerabilities are previously unknown. The apps of each vulnerability family in total have over 50 million downloads. We also propose countermeasures and highlight promising directions for technology transfer.

File-level vs. module-level regression test selection for .NET

Regression testing is used to check the correctness of evolving software. With the adoption of Agile development methodology, the number of tests and software revisions has dramatically increased, and hence has the cost of regression testing. Researchers proposed regression test selection (RTS) techniques that optimize regression testing by skipping tests that are not impacted by recent program changes. Ekstazi is one such state-of-the art technique; Ekstazi is implemented for the Java programming language and has been adopted by several companies and open-source projects.

We report on our experience implementing and evaluating Ekstazi#, an Ekstazi-like tool for .NET. We describe the key challenges of bringing the Ekstazi idea to the .NET platform. We evaluate Ekstazi# on 11 open-source projects, as well as an internal Microsoft project substantially larger than each of the open-source projects. Finally, we compare Ekstazi# to an incremental build system (also developed at Microsoft), which, out of the box, provides module-level dependency tracking and skipping tasks (including test execution) whenever dependencies of a task do not change between the current and the last successful build. Ekstazi# on average reduced regression testing time by 43.70% for the open-source projects and by 65.26% for the Microsoft project (the latter is in addition to the savings provided by incremental builds).

Record and replay for Android: are we there yet in industrial cases?

Mobile applications, or apps for short, are gaining popularity. The input sources (e.g., touchscreen, sensors, transmitters) of the smart devices that host these apps enable the apps to offer a rich experience to the users, but these input sources pose testing complications to the developers (e.g., writing tests to accurately utilize multiple input sources together and be able to replay such tests at a later time). To alleviate these complications, researchers and practitioners in recent years have developed a variety of record-and-replay tools to support the testing expressiveness of smart devices. These tools allow developers to easily record and automate the replay of complicated usage scenarios of their app. Due to Android's large share of the smart-device market, numerous record-and-replay tools have been developed using a variety of techniques to test Android apps. To better understand the strengths and weaknesses of these tools, we present a comparison of popular record-and-replay tools from researchers and practitioners, by applying these tools to test three popular industrial apps downloaded from the Google Play store. Our comparison is based on three main metrics: (1) ability to reproduce common usage scenarios, (2) space overhead of traces created by the tools, and (3) robustness of traces created by the tools (when being replayed on devices with different resolutions). The results from our comparison show which record-and-replay tools may be the best for developers and identify future directions for improving these tools to better address testing complications of smart devices.

Model-driven software engineering in practice: privacy-enhanced filtering of network traffic

Network traffic data contains a wealth of information for use in security analysis and application development. Unfortunately, it also usually contains confidential or otherwise sensitive information, prohibiting sharing and analysis. Existing automated anonymization solutions are hard to maintain and tend to be outdated.

We present Privacy-Enhanced Filtering (PEF), a model-driven prototype framework that relies on declarative descriptions of protocols and a set of filter rules, which are used to automatically transform network traffic data to remove sensitive information. This paper discusses the design, implementation and application of PEF, which is available as open-source software and configured for use in a typical malware detection scenario.

SESSION: The State of the Practice

Strong agile metrics: mining log data to determine predictive power of software metrics for continuous delivery teams

ING Bank, a large Netherlands-based internationally operating bank, implemented a fully automated continuous delivery pipe-line for its software engineering activities in more than 300 teams, that perform more than 2500 deployments to production each month on more than 750 different applications. Our objective is to examine how strong metrics for agile (Scrum) DevOps teams can be set in an iterative fashion. We perform an exploratory case study that focuses on the classification based on predictive power of software metrics, in which we analyze log data derived from two initial sources within this pipeline. We analyzed a subset of 16 metrics from 59 squads. We identified two lagging metrics and assessed four leading metrics to be strong.

Screening heuristics for project gating systems

Continuous Integration (CI) is a hot topic. Yet, little attention is payed to how CI systems work and what impacts their behavior. In parallel, bug prediction in software is gaining high attention. But this is done mostly in the context of software engineering, and the relation to the realm of CI and CI systems engineering has not been established yet. In this paper we describe how Project Gating systems operate, which are a specific type of CI systems used to keep the mainline of development always clean. We propose and evaluate three heuristics for improving Gating performance and demonstrate their trade-offs. The third heuristic, which leverages state-of-the-art bug prediction achieves the best performance across the entire spectrum of workload conditions.

Natural language querying in SAP-ERP platform

With the omnipresence of mobile devices coupled with recent advances in automatic speech recognition capabilities, there has been a growing demand for natural language query (NLQ) interface to retrieve information from the knowledge bases. Business users particularly find this useful as NLQ interface enables them to ask questions without the knowledge of the query language or the data schema. In this paper, we apply an existing research technology called ``ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores'' in the industry domain of SAP-ERP systems. The goal is to enable users to query SAP-ERP data using natural language. We present the challenges and their solutions of such a technology transfer. We present the effectiveness of the natural language query interface on a set of questions given by a set of SAP practitioners.

Serverless computing: economic and architectural impact

Amazon Web Services unveiled their "Lambda" platform in late 2014. Since then, each of the major cloud computing infrastructure providers has released services supporting a similar style of deployment and operation, where rather than deploying and running monolithic services, or dedicated virtual machines, users are able to deploy individual functions, and pay only for the time that their code is actually executing. These technologies are gathered together under the marketing term "serverless" and the providers suggest that they have the potential to significantly change how client/server applications are designed, developed and operated.

This paper presents two case industrial studies of early adopters, showing how migrating an application to the Lambda deployment architecture reduced hosting costs - by between 66% and 95% - and discusses how further adoption of this trend might influence common software architecture design practices.

SESSION: Understanding Software Developers

What do software engineers care about? gaps between research and practice

It is a cliche to say that there is a gap between research and practice. As the interest and importance in the practical impact of research has been growing, the gap between research and practice is expected to be narrowing. However, our study reveals that there still seems to be a wide gap. We survey so ware engineers about what they care about when developing so ware. We then compare our survey results with the research topics of the papers published in ICSE/FSE recently. We found the following discrepancy: while so ware engineers care more about so ware development productivity than the quality of so ware, papers on research areas closely related to so ware productivity--such as so ware development process management and so ware development techniques--are significantly less published than papers on so ware verification and validation that account for more than half of publications. We also found that so ware engineers are in great need for techniques for accurate effort estimation, and they are not necessarily knowledgable about techniques they can use to meet their needs.

Reference architectures and Scrum: friends or foes?

Software reference architectures provide templates and guidelines for designing systems in a particular domain. Companies use them to achieve interoperability of (parts of) their software, standardization, and faster development. In contrast to system-specific software architectures that "emerge" during development, reference architectures dictate significant parts of the software design early on. Agile software development frameworks (such as Scrum) acknowledge changing software requirements and the need to adapt the software design accordingly. In this paper, we present lessons learned about how reference architectures interact with Scrum (the most frequently used agile process framework). These lessons are based on observing software development projects in five companies. We found that reference architectures can support good practice in Scrum: They provide enough design upfront without too much effort, reduce documentation activities, facilitate knowledge sharing, and contribute to "architectural thinking" of developers. However, reference architectures can impose risks or even threats to the success of Scrum (e.g., to self-organizing and motivated teams).

Guidelines for adopting frontend architectures and patterns in microservices-based systems

Microservice-based systems enable the independent development, deployment, and scalability for separate system components of enterprise applications. A significant aspect during development is the microservice integration in frontends of web, mobile, and desktop applications. One challenge here is the selection of an adequate frontend architecture as well as suitable patterns that satisfy the application requirements. This paper analyses available strategies for organizing and implementing microservices frontends. These approaches are then evaluated based on a quality model and various prototypes of the same application implemented using the distinct approaches. The results of this analysis are generalized to a guideline that supports the selection of a suitable architecture.

Improving understanding of dynamically typed software developed by agile practitioners

Agile Development values working software over documentation. Therefore, in maintenance stages of existing software, the source code is the sole software artifact that developers have for analyzing the viability and impact of a new user story. Since functionality is often spread in hundreds of lines of code, it is hard for the developer to understand the system, which may lead to under-/overestimation of the new feature cost and rework/delays in the subsequent phases of development. In a previous work, we proposed a Model-Driven Reverse Engineering approach for obtaining software visualizations from source code. Two case studies of comprehension of applications written in statically typed languages have shown the applicability of this approach. A recent experience with an industrial partner, where the systems are developed on dynamically typed languages, has motivated us to adapt the previous proposal to take as input not only the source code but also the application data schema to complete the information that is missing in the code, and then automatically generate more meaningful diagrams that help developers in maintenance tasks. In this article, we present the adaptation of the general approach to support data schema as an additional input and its instrumentation in an industrial case study where the technology is Ruby on Rails. The paper ends by explaining the precision and performance of the instrumentation when used in a Colombian company as well as lessons learned.

SESSION: Data-Driven Improvement

Automated identification of security issues from commit messages and bug reports

The number of vulnerabilities in open source libraries is increasing rapidly. However, the majority of them do not go through public disclosure. These unidentified vulnerabilities put developers' products at risk of being hacked since they are increasingly relying on open source libraries to assemble and build software quickly. To find unidentified vulnerabilities in open source libraries and secure modern software development, we describe an efficient automatic vulnerability identification system geared towards tracking large-scale projects in real time using natural language processing and machine learning techniques. Built upon the latent information underlying commit messages and bug reports in open source projects using GitHub, JIRA, and Bugzilla, our K-fold stacking classifier achieves promising results on vulnerability identification. Compared to the state of the art SVM-based classifier in prior work on vulnerability identification in commit messages, we improve precision by 54.55% while maintaining the same recall rate. For bug reports, we achieve a much higher precision of 0.70 and recall rate of 0.71 compared to existing work. Moreover, observations from running the trained model at SourceClear in production for over 3 months has shown 0.83 precision, 0.74 recall rate, and detected 349 hidden vulnerabilities, proving the effectiveness and generality of the proposed approach.

LaChouTi: kernel vulnerability responding framework for the fragmented Android devices

The most criticized problem in the Android ecosystem is fragmentation, i.e., 24,093 Android devices in the wild are made by 1,294 manufacturers and installed with extremely customized operating systems. The existence of so many different active versions of Android makes security updates and vulnerability responses across the whole range of Android devices difficult. In this paper, we seek to respond to the unpatched kernel vulnerabilities for the fragmented Android devices. Specifically, we propose and implement LaChouTi, which is an automated kernel security update framework consisting of cloud service and end application update. LaChouTi first tracks and identifies the exposed vulnerabilities according to the CVE-Patch map for the target Android kernels. Then, it generates differential binary patches for the identified results. Finally, it pushes and applies the patches to the kernels. We evaluate LaChouTi using 12 Nexus Android devices that have different Android versions, different kernel versions, different series and different manufacturers, and find 1922 unpatched kernel vulnerabilities in these devices. The results show that: (1) the security risk of unpatched vulnerabilities caused by fragmentation is serious; and (2) the proposed LaChouTi is effective in responding to such security risk. Finally, we implement LaChouTi on new commercial devices by collaborating with four internationally renowned manufacturers. The results demonstrate that LaChouTi is effective for the manufacturers'security updates.

Applying deep learning based automatic bug triager to industrial projects

Finding the appropriate developer for a bug report, so called `Bug Triage', is one of the bottlenecks in the bug resolution process. To address this problem, many approaches have proposed various automatic bug triage techniques in recent studies. We argue that most previous studies focused on open source projects only and did not consider deep learning techniques. In this paper, we propose to use Convolutional Neural Network and word embedding to build an automatic bug triager. The results of the experiments applied to both industrial and open source projects reveal benefits of the automatic approach and suggest co-operation of human and automatic triagers. Our experience in integrating and operating the proposed system in an industrial development environment is also reported.

Static analysis for optimizing big data queries

Query languages for big data analysis provide user extensibility through a mechanism of user-defined operators (UDOs). These operators allow programmers to write proprietary functionalities on top of a relational query skeleton. However, achieving effective query optimization for such languages is extremely challenging since the optimizer needs to understand data dependencies induced by UDOs. SCOPE, the query language from Microsoft, allows for hand coded declarations of UDO data dependencies. Unfortunately, most programmers avoid using this facility since writing and maintaining the declarations is tedious and error-prone. In this work, we designed and implemented two sound and robust static analyses for computing UDO data dependencies. The analyses can detect what columns of an input table are never used or pass-through a UDO unchanged. This information can be used to significantly improve execution of SCOPE scripts. We evaluate our analyses on thousands of real-world queries and show we can catch many unused and pass-through columns automatically without relying on any manually provided declarations.

Automated testing of hybrid Simulink/Stateflow controllers: industrial case studies

We present the results of applying our approach for testing Simulink controllers to one public and one proprietary model, both industrial. Our approach combines explorative and exploitative search algorithms to visualize the controller behavior over its input space and to identify test scenarios in the controller input space that violate or are likely to violate the controller requirements. The engineers' feedback shows that our approach is easy to use in practice and gives them confidence about the behavior of their models.

SESSION: Dynamic Analysis

QEMU-based framework for non-intrusive virtual machine instrumentation and introspection

This paper presents the framework based on the emulator QEMU. Our framework provides set of multi-platform analysis tools for the virtual machines and mechanism for creating instrumentation and analysis tools. Our framework is based on a lightweight approach to dynamic analysis of binary code executed in virtual machines. This approach is non-intrusive and provides system-wide analysis capabilities. It does not require loading any guest agents and source code of the OS. Therefore it may be applied to ROM-based guest systems and enables using of record/replay of the system execution. We use application binary interface (ABI) of the platform to be analyzed for creating introspection tools. These tools recover the part of kernel-level information related to the system calls executed on the guest machine.

RunDroid: recovering execution call graphs for Android applications

Fault localization is a well-received technique for helping developers to identify faulty statements of a program. Research has shown that the coverages of faulty statements and its predecessors in program dependence graph are important for effective fault localization. However, app executions in Android split into segments in different components, i.e., methods, threads, and processes, posing challenges for traditional program dependence computation, and in turn rendering fault localization less effective. We present RunDroid, a tool for recovering the dynamic call graphs of app executions in Android, assisting existing tools for more precise program dependence computation. For each exectuion, RunDroid captures and recovers method calls from not only the application layer, but also between applications and the Android framework. Moreover, to deal with the widely adopted multi-threaded communications in Android applications, RunDroid also captures methods calls that are split among threads. Demo : Video :

RGSE: a regular property guided symbolic executor for Java

It is challenging to effectively check a regular property of a program. This paper presents RGSE, a regular property guided dynamic symbolic execution (DSE) engine, for finding a program path satisfying a regular property as soon as possible. The key idea is to evaluate the candidate branches based on the history and future information, and explore the branches along which the paths are more likely to satisfy the property in priority. We have applied RGSE to 16 real-world open source Java programs, totaling 270K lines of code. Compared with the state-of-the-art, RGSE achieves two orders of magnitude speedups for finding the first target path. RGSE can benefit many research topics of software testing and analysis, such as path-oriented test case generation, typestate bug finding, and performance tuning. The demo video is at:, and RGSE can be accessed at:

A tool for automated reasoning about traces based on configurable formal semantics

We present Tarski, a tool for specifying configurable trace semantics to facilitate automated reasoning about traces. Software development projects require that various types of traces be modeled between and within development artifacts. For any given artifact (e.g., requirements, architecture models and source code), Tarski allows the user to specify new trace types and their configurable semantics, while, using the semantics, it automatically infers new traces based on existing traces provided by the user, and checks the consistency of traces. It has been evaluated on three industrial case studies in the automotive domain (

VART: a tool for the automatic detection of regression faults

In this paper we present VART, a tool for automatically revealing regression faults missed by regression test suites. Interestingly, VART is not limited to faults causing crashing or exceptions, but can reveal faults that cause the violation of application-specific correctness properties. VART achieves this goal by combining static and dynamic program analysis.


DynAlloy analyzer: a tool for the specification and analysis of alloy models with dynamic behaviour

We describe DynAlloy Analyzer, a tool that extends Alloy Analyzer with support for dynamic elements in Alloy models. The tool builds upon Alloy Analyzer in a way that makes it fully compatible with Alloy models, and extends their syntax with a particular idiom, inspired in dynamic logic, for the description of dynamic behaviours, understood as sequences of states over standard Alloy models, in terms of programs. The syntax is broad enough to accommodate abstract dynamic behaviours, e.g., using nondeterministic choice and finite unbounded iteration, as well as more concrete ones, using standard sequential programming constructions. The analysis of DynAlloy models resorts to the analysis of Alloy models, through an optimized translation that often makes the analysis more efficient than that of typical ad-hoc constructions to capture dynamism in Alloy.

Tool screencast, binaries and further details available in:

From scenario modeling to scenario programming for reactive systems with dynamic topology

Software-intensive systems often consist of cooperating reactive components. In mobile and reconfigurable systems, their topology changes at run-time, which influences how the components must cooperate. The Scenario Modeling Language (SML) offers a formal approach for specifying the reactive behavior such systems that aligns with how humans conceive and communicate behavioral requirements. Simulation and formal checks can find specification flaws early. We present a framework for the Scenario-based Programming (SBP) that reflects the concepts of SML in Java and makes the scenario modeling approach available for programming. SBP code can also be generated from SML and extended with platform-specific code, thus streamlining the transition from design to implementation. As an example serves a car-to-x communication system. Demo video and artifact:

CLTSA: labelled transition system analyser with counting fluent support

In this paper we present CLTSA (Counting Fluents Labelled Transition System Analyser), an extension of LTSA (Labelled Transition System Analyser) that incorporates counting fluents, a useful mechanism to capture properties related to counting events. Counting fluent temporal logic is a formalism for specifying properties of event-based systems, which complements the notion of fluent by the related concept of counting fluent. While fluents allow us to capture boolean properties of the behaviour of a reactive system, counting fluents are numerical values, that enumerate event occurrences.

The tool supports a superset of FSP (Finite State Processes), that allows one to define LTL properties involving counting fluents, which can be model checked on FSP processes. Detailed information can be found at

The MONDO collaboration framework: secure collaborative modeling over existing version control systems

Model-based systems engineering of critical cyber-physical systems necessitates effective collaboration between different stakeholders while still providing secure protection of intellectual properties of all involved parties. While engineering artifacts are frequently stored in version control repositories, secure access control is limited to file-level strategies in most existing frameworks where models are split into multiple fragments with all-or-nothing permissions, which becomes a scalability and usability bottleneck in case of complex industrial models.

In this paper, we introduce the MONDO Collaboration Framework, which provides rule-based fine-grained model-level secure access control, property-based locking and automated model merge integrated over existing version control systems such as Subversion (SVN) for storage and version control. Our framework simultaneously supports offline collaboration (asynchronous checkout-modify-commit) on top of off-the-shelf modeling tools and online scenarios (GoogleDocs-style short transactions) scenarios by offering a web-based modeling frontend.

Screencast Demo:

Model-based privacy and security analysis with CARiSMA

We present CARiSMA, a tool that is originally designed to support model-based security analysis of IT systems. In our recent work, we added several new functionalities to CARiSMA to support the privacy of personal data. Moreover, we introduced a mechanism to assist the system designers to perform a CARiSMA analysis by automatically initializing an appropriate CARiSMA analysis concerning security and privacy requirements. The motivation for our work is Article 25 of Regulation (EU) 2016/679, which requires appropriate technical and organizational controls must be implemented for ensuring that, by default, the processing of personal data complies with the principles on processing of personal data. This implies that initially IT systems must be analyzed to verify if such principles are respected. System models allow the system developers to handle the complexity of systems and to focus on key aspects such as privacy and security. CARiSMA is available at and our screen cast at


Cherry-picking of code commits in long-running, multi-release software

This paper presents Tartarian, a tool that supports maintenance of software with long-running, multi-release branches in distributed version control systems. When new maintenance code, such as bug fixes and code improvement, is committed into a branch, it is likely that such code can be applied or reused with some other branches. To do so, a developer may manually identify a commit and cherry pick it. Tartarian can support this activity by providing commit hashtags, which the developer uses as metadata to specify their intentions when committing the code. With these tags, Tartarian uses dependency graph, that represents the dependency constraints of the branches, and Branch Identifier, which matches the commit hashtags with the dependency graph, to identify the applicable branches for the commits. Using Tartarian, developers may be able to maintain software with multiple releases more efficiently. A video demo of Tartarian is available at

ARCC: assistant for repetitive code comprehension

As software projects evolve, carefully understanding the behavior of a program is mandatory before making any change. Repetitive code snippets also tend to appear throughout the codebase, and developers have to understand similar semantics multiple times. Building on this observation, we present Arcc: an Assistant for Repetitive Code Comprehension. The tool, implemented as an Eclipse plugin, assists developers in leveraging knowledge of a program to understand other programs containing a subset of the semantics in the former. Arcc differs from existing approaches in that it uses an extensible knowledge base of recurrent semantic code snippets, instead of heuristics or salient features, to summarize the behavior of a program. Given a program, we detect the occurrences of such snippets. Developers can create strategies as combinations of the snippets found and search for strategy occurrences in their workspace. Arcc highlights the source code related to every snippet and their interleaving, assisting in getting an intuition of similar programs. Finally, Arcc underlines potential common errors associated with the snippets, assisting in detecting overlooked problems.

JoanAudit: a tool for auditing common injection vulnerabilities

JoanAudit is a static analysis tool to assist security auditors in auditing Web applications and Web services for common injection vulnerabilities during software development. It automatically identifies parts of the program code that are relevant for security and generates an HTML report to guide security auditors audit the source code in a scalable way. JoanAudit is configured with various security-sensitive input sources and sinks relevant to injection vulnerabilities and standard sanitization procedures that prevent these vulnerabilities. It can also automatically fix some cases of vulnerabilities in source code — cases where inputs are directly used in sinks without any form of sanitization — by using standard sanitization procedures. Our evaluation shows that by using JoanAudit, security auditors are required to inspect only 1% of the total code for auditing common injection vulnerabilities. The screen-cast demo is available at

XSearch: a domain-specific cross-language relevant question retrieval tool

During software development process, Chinese developers often seek solutions to the technical problems they encounter by searching relevant questions on Q&A sites. When developers fail to find solutions on Q&A sites in Chinese, they could translate their query and search on the English Q&A sites. However, Chinese developers who are non-native English speakers often are not comfortable to ask or search questions in English, as they do not know the proper translation of the Chinese technical words into the English technical words. Furthermore, the process of manually formulating cross-language queries and determining the importance of query words is a tedious and time-consuming process. For the purpose of helping Chinese developers take advantages of the rich knowledge base of the English version of Stack Overflow and simplify the retrieval process, we propose an automated cross-language relevant question retrieval tool (XSearch) to retrieve relevant English questions on Stack Overflow for a given Chinese question. This tool can address the increasing need for developer to solve technical problems by retrieving cross-language relevant Q&A resources. Demo Tool Website: Demo Video:

SESSION: Doctoral Symposium

Using search-based software engineering to handle the changes with uncertainties for self-adaptive systems

The changes confronting contemporary Self-Adaptive Systems (SASs) are characterized by uncertainties in their relationships, priorities, and contexts. To generate adaptation strategies for handling these changes, existing adaptation planning methods, which ignore these uncertainties, must be improved. This thesis explores the possibilities of using Search-Based Software Engineering (SBSE) to establish a search-based planning method capable of handling multiple changes in an uncertain context without defining their priorities. Meanwhile, both the assurance approach to improving the efficiency of adaptation planning and the selection approach to choosing a unique strategy are proposed to solve emerging research questions that arise when such planning method is applied in actual SASs. From this experience, we are able to derive innovative methods for the designers of SASs as a reference, which may observably improve the ability of SASs and promote the widespread use of SBSE in SASs.

DRACO: discovering refactorings that improve architecture using fine-grained co-change dependencies

Co-change dependencies arise whenever two source code entities, such as classes, methods, or fields, change frequently together. Similar to other kinds of software dependencies, it is possible to build software clusters from co-change relationships and, as such, previous studies explored the use of this kind of dependency in several software engineering tasks, such as predicting software faults, recommending related source code changes, and assessing software modularity. In this ongoing work, our goal is to provide tools to discover refactoring opportunities-such as move method, move field, split class, or merge classes-that are revealed when comparing the co-change clusters of fine-grained source code entities (methods, fields, constructors) to the original class decomposition; specifically when a source code entity is in the same class but in different clusters (or vice-versa). Our approach, named Draco, aims to produce minimal refactoring sequences that improve architecture decomposition.

User- and analysis-driven context aware software development in mobile computing

Mobile applications may benefit from context awareness since they incur to context changes during their execution and their success depends on the user perceived quality. Context awareness requires context monitoring and system adaptation, these two tasks are very expensive especially in mobile applications. Our research aims at developing a methodology that enables effective context awareness techniques for mobile applications that allows adaptations of the mobile app to context changes so that the desired system quality properties and user satisfaction is maximized. Here effective means selecting a minimum set of context variables to monitor and a minimum set of adaptive tactics to inject into mobile applications that allows to guarantee the required software quality and to maximize the user satisfaction. In this paper, we show the devised methodology on a motivating example, detailing the ongoing work.

Recommender system for model driven software development

Models are key artifacts in model driven software engineering, similar to source code in traditional software engineering. Integrated development environments help users while writing source code, e.g. with typed auto completions, quick fixes, or automatic refactorings. Similar integrated features are rare for modeling IDEs. The above source code IDE features can be seen as a recommender system.

A recommender system for model driven software engineering can combine data from different sources in order to infer a list of relevant and actionable model changes in real time. These recommendations can speed up working on models by automating repetitive tasks and preventing errors when the changes are atypical for the changed models.

Recommendations can be based on common model transformations that are taken from the literature or learned from models in version control systems. Further information can be taken from instance- to meta-model relationships, modeling related artifacts (e.g. correctness constraints), and versions histories of models under version control.

We created a prototype recommender that analyses the change history of a single model. We computed its accuracy via cross-validation and found that it was between 0.43 and 0.82 for models from an open source project.

In order to have a bigger data set for the evaluation and the learning of model transformation, we also mined repositories from Eclipse projects for Ecore meta models and their versions. We found 4374 meta models with 17249 versions. 244 of these meta models were changed at least ten times and are candidates for learning common model transformations.

We plan to evaluate our recommender system in two ways: (1) In off-line evaluations with data sets of models from the literature, created by us, or taken from industry partners. (2) In on-line user studies with participants from academia and industry, performed as case studies and controlled experiments.

On the similarity of software development documentation

Software developers spent 20% of their time on information seeking on Stack Overflow, YouTube or an API reference documentation. Software developers can search within Stack Overflow for duplicates or similar posts. They can also take a look on software development documentations that have similar and additional information included as a Stack Overflow post or a development screencast in order to get new inspirations on how to solve their current development problem. The linkage of same and different types of software development documentation might safe time to evolve new software solutions and might increase the productivity of the developer’s work day. In this paper we will discuss our approach to get a broader understanding of different similarity types (exact, similar and maybe) within and between software documentation as well as an understanding of how different software documentations can be extended.

Application of search-based software engineering methodologies for test suite optimization and evolution in mission critical mobile application development

The demand for high quality mobile applications is constantly rising, especially in mission critical settings. Thus, new software engineering methodologies are needed in order to ensure the desired quality of an application. The research presented proposes a quality assurance methodology for mobile applications through test automation by optimizing test suites. The desired goal is to find a minimal test suite while maintaining efficiency and reducing execution cost. Furthermore to avoid invalidating an optimized test suite as the system under test evolves, the approach further proposes to extract patterns from the applied changes to an application. The evaluation plan comprises a combination of an empirical and an industrial case study based on open source projects and an industrial project in the healthcare domain. It is expected that the presented approach supports the testing process on mobile application platforms.

Summarizing software engineering communication artifacts from different sources

During software development, developers communicate a lot and with many different people. Communication is an important factor, to the point that communication failures are seen as the causes of productivity losses or even project failures. To communicate with each other, software developers use many different tools like mailing lists, forums, issue trackers or chats. Even in a short time span, a lot of information artifacts can arise through these channels, which can be very time consuming to get through after a long vacation or for new members of the team. This paper describes a research plan for an approach which can summarize different communication sources into one big summary using and improving existing text summarization approaches. The resulting tool would have the potential to decrease the effort needed for sense-making and comprehension of communication, as well as the time needed for locating and using information from the communication sources. This reduction in effort will result in a significant increase in the productivity of software development companies.

Model-based dynamic software project scheduling

Software project scheduling, under uncertain and dynamic environments, is one of the most important challenges in software engineering. Recent studies addressed this challenge in both static and dynamic scenarios for small and medium size software projects. The increasing trend of cloud based software solutions (large scale software projects) needs agility not only for sustainable maintenance but also for in time and within budget completion. Therefore, this paper formulates software project scheduling problem (SPSP) as an optimization problem under uncertainties and dynamics for hybrid scRUmP software model. In this regard, a mathematical model is constructed with five objectives as project duration, task fragmentation, robustness, cost, and stability.

System performance optimization via design and configuration space exploration

The runtime performance of a software system often depends on a large number of static parameters, which usually interact in complex ways to carry out system functionality and influence system performance. It's hard to understand such configuration spaces and find good combinations of parameter values to gain available levels of performance. Engineers in practice often just accept the default settings, leading such systems to significantly underperform relative to their potential. This problem, in turn, has impacts on cost, revenue, customer satisfaction, business reputation, and mission effectiveness. To improve the overall performance of the end-to-end systems, we propose to systematically explore (i) how to design new systems towards good performance through design space synthesis and evaluation, and (ii) how to auto-configure an existing system to obtain better performance through heuristic configuration space search. In addition, this research further studies execution traces of a system to predict runtime performance under new configurations.

SESSION: Student Research Competition

Suggesting meaningful variable names for decompiled code: a machine translation approach

Decompiled code lacks meaningful variable names. We used statistical machine translation to suggest variable names that are natural given the context. This technique has previously been successfully applied to obfuscated JavaScript code, but decompiled C code poses unique challenges in constructing an aligned corpus and selecting the best translation from among several candidates.

Practical symbolic verification of regular properties

It is challenging to verify regular properties of programs. This paper presents symbolic regular verification (SRV), a dynamic symbolic execution based technique for verifying regular properties. The key technique of SRV is a novel synergistic combination of property-oriented path slicing and guiding to mitigate the path explosion problem. Indeed, slicing can prune redundant paths, while guiding can boost the finding of counterexamples. We have implemented SRV for Java and evaluated it on 16 real-world open-source Java programs (totaling 270K lines of code). The experimental results demonstrate the effectiveness and efficiency of SRV.

FOSS version differentiation as a benchmark for static analysis security testing tools

We propose a novel methodology that allows automatic construction of benchmarks for Static Analysis Security Testing (SAST) tools based on real-world software projects by differencing vulnerable and fixed versions in FOSS repositories. The methodology allows us to evaluate ``actual'' performance of SAST tools (without unrelated alarms). To test our approach, we benchmarked 7 SAST tools (although we report only results for the two best tools), against 70 revisions of four major versions of Apache Tomcat with 62 distinct CVEs as the source of ground truth vulnerabilities.

DecisionDroid: a supervised learning-based system to identify cloned Android applications

This study presents DecisionDroid, a supervised learning based system to identify cloned Android app pairs. DecisionDroid is trained using a manually verified diverse dataset of 12,000 Android app pairs. On a hundred ten-fold cross validations, DecisionDroid achieved 97.9% precision, 98.3% recall, and 98.4% accuracy.

Reasons and drawbacks of using trivial npm packages: the developers' perspective

Code reuse is traditionally seen as good practice. Recent trends have pushed the idea of code reuse to an extreme, by using packages that implement simple and trivial tasks, which we call ‘trivial packages’. A recent incident where a trivial package led to the breakdown of some of the most popular web applications such as Facebook and Netflix, put the spotlight on whether using trivial packages should be encouraged. Therefore, in this research, we mine more than 230,000 npm packages and 38,000 JavaScript projects in order to study the prevalence of trivial packages. We found that trivial packages are common, making up 16.8% of the studied npm packages. We performed a survey with 88 Node.js developers who use trivial packages to understand the reasons for and drawbacks of their use. We found that trivial packages are used because they are perceived to be well-implemented and tested pieces of code. However, developers are concerned about maintaining and the risks of breakages due to the extra dependencies trivial packages introduce.

Detecting wearable app permission mismatches: a case study on Android wear

Wearable devices are becoming increasingly popular. These wearable devices run what is known as wearable apps. Wearable apps are packaged with handheld apps, that must be installed on the accompanying handheld device (e.g., phone).

Given that wearable apps are tightly coupled with the handheld apps, any wearable permission must also be requested in the handheld version of the app on the Android Wear platform. However, in some cases, the wearable apps may request permissions that do not exist in the handheld app, resulting in a permission mismatch, and causing the wearable app to error or crash. In this paper, we propose a technique to detect wear app permission mismatches. We perform a case study on 2,409 free Android Wear apps and find that 73 released wearable apps suffer from the permission mismatch problem.

Automating traceability link recovery through classification

Traceability Link Recovery (TLR) is an important software engineering task in which a stakeholder establishes links between related items in two sets of software artifacts. Most existing approaches leverage Information Retrieval (IR) techniques, and formulate the TLR task as a retrieval problem, where pairs of similar artifacts are retrieved and presented to a user. These approaches still require significant human effort, as a stakeholder needs to manually inspect the list of recommendations and decide which ones are true links and which ones are false. In this work, we aim to automate TLR by re-imagining it as a binary classification problem. More specifically, our machine learning classification approach is able to automatically classify each link in the set of all potential links as either valid or invalid, therefore circumventing the substantial human effort required by existing techniques.

Improving performance of automatic program repair using learned heuristics

Automatic program repair offers the promise of significant reduction in debugging time, but still faces challenges in making the process efficient, accurate, and generalizable enough for practical application. Recent efforts such as Prophet demonstrate that machine learning can be used to develop heuristics about which patches are likely to be correct, reducing overfitting problems and improving speed of repair. SearchRepair takes a different approach to accuracy, using blocks of human-written code as patches to better constrain repairs and avoid overfitting. This project combines Prophet's learning techniques with SearchRepair's larger block size to create a method that is both fast and accurate, leading to higher-quality repairs. We propose a novel first-pass filter to substantially reduce the number of candidate patches in SearchRepair and demonstrate 85% reduction in runtime over standard SearchRepair on the IntroClass dataset.