Mutation-based fault localization (MBFL) is a promising direction toward improving fault localization accuracy by leveraging dynamic information extracted by mutation testing on a target program. One issue in investigating MBFL techniques is that experimental evaluations are prone to various validity threats because there are many factors to control in generating and running mutants for fault localization. To understand different validity threats in experimenting MBFL techniques, this paper reports our re-production of the MBFL assessments with Defects4J originally studied by Pearson et al. Having the JFreeChart artifacts of Defects4J (total 26 bug cases) as study objects, we conducted the same empirical evaluation on two MBFL (Metallaxis and MUSE) and six SBFL techniques (DStar, Op2, Ochiai, Jaccard, Barniel, Tarantula) while identifying and managing validity threats in alternative ways. As results, we found that the evaluation on the studied techniques change in many parts from the original results, thus, the identified validity threats should be managed carefully at designing and conducting empirical assessments on MBFL techniques.
Privacy is increasingly getting importance in modern systems. As a matter of fact, personal data are out of the control of the original owner and remain in the hands of the software-systems producers. In this new ideas paper, we drastically change the nature of data from passive to active as a way to empower the user and preserve both the original ownership of the data and the privacy policies specified by the data owner. We demonstrate the idea of active data in the mobile domain.
Threat modeling involves the systematic identification and analysis of security threats in the context of a specific system. This paper starts from an assessment of its current state of practice, based on interactions with threat modeling professionals. We argue that threat modeling is still at a low level of maturity and identify the main criteria for successful adoption in practice. Furthermore, we identify a set of key research challenges for aligning threat modeling research to industry practice, thereby raising the technology-readiness levels of the ensuing solutions, approaches, and tools.
Despite the unarguable importance of Stack Overflow (SO) for the daily work of many software developers and despite existing knowledge about the impact of code duplication on software maintainability, the prevalence and implications of code clones on SO have not yet received the attention they deserve. In this paper, we motivate why studies on code duplication within SO are needed and how existing studies on code reuse differ from this new research direction. We present similarities and differences between code clones in general and code clones on SO and point to open questions that need to be addressed to be able to make data-informed decisions about how to properly handle clones on this important platform. We present results from a first preliminary investigation, indicating that clones on SO are common and diverse. We further point to specific challenges, including incentives for users to clone successful answers and difficulties with bulk edits on the platform, and conclude with possible directions for future work.
Bug severity is an important factor in prioritizing which bugs to fix first. The process of triaging bug reports and assigning a severity requires developer expertise and knowledge of the underlying software. Methods to automate the assignment of bug severity have been developed to reduce the developer cost, however, many of these methods require 70-90% of the project's bug reports as training data and delay their use until later in the development process. Not being able to automatically predict a bug report's severity early in a project can greatly reduce the benefits of automation. We have developed a new bug report severity prediction method that leverages how bug reports are written rather than what the bug reports contain. Our method allows for the prediction of bug severity at the beginning of the project by using an organization's historical data, in the form of bug reports from past projects, to train the prediction classifier. In validating our approach, we conducted over 1000 experiments on a dataset of five NASA robotic mission software projects. Our results demonstrate that our method was not only able to predict the severity of bugs earlier in development, but it was also able to outperform an existing keyword-based classifier for a majority of the NASA projects.
Programmers should write code comments, but not on every line of code. We have created a machine learning model that suggests locations where a programmer should write a code comment. We trained it on existing commented code to learn locations that are chosen by developers. Once trained, the model can predict locations in new code. Our models achieved precision of 74% and recall of 13% in identifying comment-worthy locations. This first success opens the door to future work, both in the new where-to-comment problem and in guiding comment generation. Our code and data is available at http://groups.inf.ed.ac.uk/cup/comment-locator/.
The surprising predictability of source code has triggered a boom in tools using language models for code. Code is much more predictable than natural language, but the reasons are not well understood. We propose a dual channel view of code; code combines a formal channel for specifying execution and a natural language channel in the form of identifiers and comments that assists human comprehension. Computers ignore the natural language channel, but developers read both and, when writing code for longterm use and maintenance, consider each channel's audience: computer and human. As developers hold both channels in mind when coding, we posit that the two channels interact and constrain each other; we call these dual channel constraints. Their impact has been neglected. We describe how they can lead to humans writing code in a way more predictable than natural language, highlight pioneering research that has implicitly or explicitly used parts of this theory, and drive new research, such as systematically searching for cross-channel inconsistencies. Dual channel constraints provide an exciting opportunity as truly multi-disciplinary research; for computer scientists they promise improvements to program analysis via a more holistic approach to code, and to psycholinguists they promise a novel environment for studying linguistic processes.
Quantum computers are becoming more mainstream. As more programmers are starting to look at writing quantum programs, they face an inevitable task of debugging their code. How should the programs for quantum computers be debugged?
In this paper, we discuss existing debugging tactics, used in developing programs for classic computers, and show which ones can be readily adopted. We also highlight quantum-computer-specific debugging issues and list novel techniques that are needed to address these issues. The practitioners can readily apply some of these tactics to their process of writing quantum programs, while researchers can learn about opportunities for future work.
Reachable sets are critical for path planning and navigation of mobile autonomous systems. Traditionally, these sets are computed using system models instantiated with their physical bounds. This exclusive focus on the physical bounds belies the fact that these systems are increasingly driven by sophisticated software components that can also bound the variables in the system models. This work explores the degree to which bounds manifested in the software can affect the computation of reachable sets, introduces an analysis approach to discover such bounds in code, and illustrates the potential of that approach on two systems. The preliminary results reveal that taking into consideration software bounds can reduce traditionally computed reachable sets by up to 91%.
Agile software development welcomes requirements changes even late in the process. However, it is unclear how agile teams respond to these changes. Therefore, as the starting point of a planned extensive study on handling requirements changes in agile projects, we examined the emotional responses to requirements changes by agile teams. In this paper, we (i) introduce a novel combined approach of using Grounded Theory and Sentiment Analysis (through SentiStrength and Job Emotion Scale), (ii) present three distinct phases of emotional responses, a summary of emotions, and variation in emotions and sentiment polarity (positivity, negativity, and neutrality) at each stage, and (iii) emphasize the necessity of taking emotional responses of agile teams into consideration when applying agile principles and practices to deal with changes.
Internet of Things (IoT) can be seen as a variety of interconnected things (e.g., RFID tags, sensors, actuators, mobile phones) to provide a certain service to final users. The interaction between these things and the user in these environments is challenging due to interoperability, consistency of interactions, easy installation and so on. Then, there is a need to provide tools for the User Experience (UX) evaluation in IoT scenarios and, in this work, we propose CHASE, a checklist to facilitate UX evaluation in IoT scenarios. First, CHASE was constructed based on the results of a literature review that identified methods, instruments, and characteristics of human-thing interaction and evaluation in these scenarios. Then, fourteen IoT experts reviewed a preliminary version of the checklist, followed by three Human-Computer Interaction (HCI) experts. Finally, three researchers used chase to evaluate the user experience in a real IoT environment to collect authentic feedback. This paper also presents a series of important points for the evaluation of UX in IoT environments based on the literature review, the opinion of the experts, and the IoT application evaluations.
Closing a question on a community question answering forum such as Stack Overflow is a highly divisive event. On one hand, moderation is of crucial importance in maintaining the content quality indispensable for the future sustainability of the site. On the other hand, details about the closing reason might frequently appear blurred to the user, which leads to debates and occasional negative behavior in answers or comments. With the aim of helping the users compose good quality questions, we introduce a set of classifiers for the categorization of Stack Overflow posts prior to their actual submission. Our binary classifier is capable of predicting whether a question will be closed after posting with an accuracy of 71.87%. Additionally, in this study we propose the first multiclass classifier to estimate the exact reason of closing a question to an accuracy of 48.55%. Both classifiers are based on Gated Recurrent Units and trained solely on the pre-submission textual information of Stack Overflow posts.
Developers are increasingly sharing images in social coding environments alongside the growth in visual interactions within social networks. The analysis of the ratio between the textual and visual content of Mozilla's change requests and in Q/As of StackOverflow programming revealed a steady increase in sharing images over the past five years. Developers' shared images are meaningful and are providing complementary information compared to their associated text. Often, the shared images are essential in understanding the change requests, questions, or the responses submitted. Relying on these observations, we delve into the potential of automatic completion of textual software artifacts with visual content.
By bringing together code, text, and examples, Jupyter notebooks have become one of the most popular means to produce scientific results in a productive and reproducible way. As many of the notebook authors are experts in their scientific fields, but laymen with respect to software engineering, one may ask questions on the quality of notebooks and their code. In a preliminary study, we experimentally demonstrate that Jupyter notebooks are inundated with poor quality code, e.g., not respecting recommended coding practices, or containing unused variables and deprecated functions. Considering the education nature of Jupyter notebooks, these poor coding practices, as well as the lacks of quality control, might be propagated into the next generation of developers. Hence, we argue that there is a strong need to programmatically analyze Jupyter notebooks, calling on our community to pay more attention to the reliability of Jupyter notebooks.
Developers from open-source communities have reported high stress levels from frequent demands for features and bug fixes and from the sometimes aggressive tone of these demands. Toxic conversations may demotivate and burn out developers, creating challenges for sustaining open source. We outline a path toward finding, understanding, and possibly mitigating such unhealthy interactions. We take a first step toward finding them, by developing and demonstrating a measurement instrument (an SVM classifier tailored for software engineering) to detect toxic discussions in GitHub issues. We used our classifier to analyze trends over time and in different GitHub communities, finding that toxicity varies by community and that toxicity decreased between 2012 and 2018.
As software is rapidly being embedded into major parts of our society, ranging from medical devices and self-driving vehicles to critical infrastructures, potential risks of software failures are also growing at an alarming pace. Existing certification processes, however, suffer from a lack of rigor and automation, and often incur a significant amount of manual effort on both system developers and certifiers. To address this issue, we propose a substantially automated, cost-effective certification method, backed with a novel analysis synthesis technique to automatically generate application-specific analysis tools that are custom-tailored to producing the necessary evidence. The outcome of this research promises to not only assist software developers in producing safer and more reliable software, but also benefit industrial certification agencies by significantly reducing the manual effort of certifiers. Early validation flows from experience applying this approach in constructing an assurance case for a surgical robot system in collaboration with the Center for the Advanced Surgical Technology.
Programs typically provide a broad range of features. Because different typologies of users tend to use only a subset of these features, and unnecessary features can harm performance and security, program debloating techniques, which can reduce the size of a program by eliminating (possibly) unneeded features, are becoming increasingly popular. Most existing debloating techniques tend to focus on program-size reduction alone and, although effective, ignore other important aspects of debloating. We believe that program debloating is a multifaceted problem that must be addressed in a more general way. In this spirit, we propose a general approach that allows for formulating program debloating as a multi-objective optimization problem. Given a program to be debloated, our approach lets users specify (1) a usage profile for the program (i.e., a set of inputs with associated usage probabilities), (2) the factors of interest for debloating, and (3) the relative importance of these factors. Based on this information, the approach defines a suitable objective function for associating a score to every possible reduced program and aims to generate an optimal solution that maximizes the objective function. We also present and evaluate Debop, a specific instance of our approach that considers three objectives: size reduction, attack-surface reduction, and generality (i.e., the extent to which the reduced program handles inputs in the usage profile provided). Our results, albeit still preliminary, are promising and show that our approach can be effective at generating debloated programs that achieve a good trade-off between the different de-bloating objectives considered. Our results also provide insights on the performance of our general approach when compared to a specialized single-goal technique.
Intermittent test failures (test flakiness) is common during continuous integration as modern software systems have become inherently non-deterministic. Understanding the root cause of test flakiness is crucial as intermittent test failures might be the result of real non-deterministic defects in the production code, rather than mere errors in the test code. Given a flaky test, existing techniques for root causing test flakiness compare the runtime behavior of its passing and failing executions. They achieve this by repetitively executing the flaky test on an instrumented version of the system under test. This approach has two fundamental limitations: (i) code instrumentation might prevent the manifestation of test flakiness; (ii) when test flakiness is rare passively re-executing a test many times might be inadequate to trigger intermittent test outcomes. To address these limitations, we propose a new idea for root causing test flakiness that actively explores the non-deterministic space without instrumenting code. Our novel idea is to repetitively execute a flaky test, under different execution clusters. Each cluster explores a certain non-deterministic dimension (e.g., concurrency, I/O, and networking) with dedicated software containers and fuzzy-driven resource load generators. The execution cluster that manifests the most balanced (or unbalanced) sets of passing and failing executions is likely to explain the broad type of test flakiness.
Application Programming Interfaces (APIs) grant developers access to the functionalities of code libraries. Due to missing knowledge of how an API is correctly used, developers can unintentionally misuse APIs, and thus introduce bugs. To tackle this issue, recent techniques aim to automatically infer specifications for correct API usage and detect misuses. Unfortunately, these techniques suffer from high false-positive rates, leading to many false alarms. While we believe that existing techniques will improve in the future, in this paper, we propose to investigate a different route: We assume that a developer manually detected and fixed an API misuse relating to a third-party library. Based on the change, we can infer a correction rule for the API misuse. Then, we can use this correction rule to detect the same and similar API misuses in the same or other projects. This represents a cooperative technique to transfer the knowledge of API-misuse fixes to other developers. We report promising insights on an implementation and empirical evidence on the applicability of our technique based on 43 real-world API misuses.
The search space explosion problem is a long-standing challenge for search-based automated program repair (APR). The operation space, which defines how to select appropriate mutation operators, and the ingredient space, which defines how to select appropriate code elements as fixing ingredients, are two major factors that determine the search space. Conventional approaches mainly devise fixing strategies via learning from frequent fixing patterns based on substantial patches collected from open-source projects. In this paper, we propose a new direction for search-based APR, that is to repair a bug via learning from how the bug was introduced instead of learning from how other bugs are frequently fixed. Our empirical study reveals that substantial mutation operators and fixing ingredients required to fix a bug can be inferred from the commit that introduced the bug. Based on the findings of our empirical study, we devised a preliminary fixing strategy based on bug-inducing commits, which is able to repair 8 new bugs that cannot be repaired by the state-of-the-art techniques. Such results demonstrate that our proposed new idea for searched-based APR is promising.
We look at the problem of deciding correctness of a programming assignment submitted by a student, with respect to a reference implementation provided by the teacher, the correctness property being output equivalence of the two programs. Typically, programming assignments are evaluated against a set of test-inputs. This checks for output equivalence, but is limited to the cases that have been tested for. One may compose the programs sequentially and assert at the end that the outputs must match. But verifying such programs is not easy; the proofs often require that the functionality of every component program be captured fully, making invariant inference a challenge. One may discover mismatches (i.e., bugs) using a bounded model checker, but their absence brings us back to the question of verification. In this paper, we show how a hypersafety verification technique can effectively be used for verifying correctness of programming assignments. This connects two seemingly unrelated problems, and opens up the possibility of employing tools and techniques being developed for the former to efficiently address the latter. We demonstrate the practicability of this approach by using a hypersafety verification tool named weaver and several sample assignment problems.
Model-driven software engineering fosters abstraction through the use of models and then automation by transforming them into various artefacts, in particular to code, for example: 1) from architectural models to code, 2) from metamodels to API code (with EMF in Eclipse), 3) from entity models to front-end and back-end code in Web stack application (with JHispter), etc. In all these examples, the generated code is usually enriched by developers with additional code implementing advanced functionalities (e.g., checkers, recommenders, etc.) to build a full coherent system. When the system must evolve, so are the models to re-generate the code. As a result, the developers' enriched code may be impacted and thus need to co-evolve accordingly. Many approaches support the co-evolution of various artifacts, but not the co-evolution of code. This paper sheds light on this issue and envisions to fill this gap.
We formulate the hypothesis that the code co-evolution can be driven by the model changes by means of change propagation. To investigate this hypothesis, we implemented a prototype for the case of metamodels and their accompanying code in EMF Eclipse. As a preliminary evaluation, we considered the case of the OCL Pivot metamodel evolution and its code co-evolution in two projects from version 3.2.2 to 3.4.4. Preliminary results confirms our hypothesis that model-driven evolution changes can effectively drive the code co-evolution. On 562 impacts in two projects' code by 221 metamodel changes, our approach was able to reach the average of 89% and 92,5% respectively of precision and recall.
Over the last decades, the Free/Libre/Open Source Software (FLOSS) phenomenon has been a topic of study and a source of real-life artifacts for software engineering research. A FLOSS project usually has a community around its project, organically producing informative resources to describe how, when, and why a particular change occurred in the source code or the development flow. Therefore, when studying this kind of project, collecting and analyzing texts and artifacts can promote a more comprehensive understanding of the phenomenon and the variety of organizational settings. However, despite the importance of examining Grey Literature (GL), such as technical reports, white papers, magazines, and blog posts for studying FLOSS projects, the GL Review is still an emerging technique in software engineering studies, lacking a well-established investigative methodology. To mitigate this gap, we present and discuss challenges and adaptations for the planning and execution of GL reviews in the FLOSS scenario. We provide a set of guidelines and lessons learned for further research, using, as an example, a review we are conducting on the Linux kernel development model.
The scale of manually validated data is currently limited by the effort that small groups of researchers can invest for the curation of such data. Within this paper, we propose the use of registered reports to scale the curation of manually validated data. The idea is inspired by the mechanical turk and replaces monetary payment with authorship of data set publication.
The increasing use of machine-learning (ML) enabled systems in critical tasks fuels the quest for novel verification and validation techniques yet grounded in accepted system assurance principles. In traditional system development, model-based techniques have been widely adopted, where the central premise is that abstract models of the required system provide a sound basis for judging its implementation. We posit an analogous approach for ML systems using an ML technique that extracts from the high-dimensional training data implicitly describing the required system, a low-dimensional underlying structure---a manifold. It is then harnessed for a range of quality assurance tasks such as test adequacy measurement, test input generation, and runtime monitoring of the target ML system. The approach is built on variational autoencoder, an unsupervised method for learning a pair of mutually near-inverse functions between a given high-dimensional dataset and a low-dimensional representation. Preliminary experiments establish that the proposed manifold-based approach, for test adequacy drives diversity in test data, for test generation yields fault-revealing yet realistic test cases, and for run-time monitoring provides an independent means to assess trustability of the target system's output.
Writing code is difficult and time consuming. This vision paper proposes Visual Sketching, a synthesis technique that produces code implementing the likely intent associated with an image. We describe potential applications of Visual Sketching, how to realize it, and implications of the technology.
Gamification is the exploitation of game mechanisms for serious applications. In general they target societal challenges by engaging people in game-like scenarios. These mechanisms are so successful that there exists a growing interest in exploiting gamification in other scenarios where in general the fundamental need is enhancing people self-motivation. Notably, a major engagement could be necessary for students attending courses, employees with decreasing enthusiasm for their work, users lacking interest for certain applications, and so forth. In this respect, even though there exist disparate solutions to create gameful applications, they are intended to be built from scratch and stand-alone. This paper instead proposes a line of research, in the software engineering field by which gameful mechanisms can be bound to existing software applications to create gamified scenarios. In this way, potential adopters are only required to define the game elements and how they should be combined, while the remaining game automation part is obtained for free. Interestingly, this approach not only simplifies the adoption of gamification elements in pre-existing applications, but it also discloses the opportunity of enhancing the engineering of gameful applications as well as the management of the combination of multiple games.
We introduce a new idea for enhancing constraint solving engines that drive many analysis and synthesis techniques that are powerful but have high complexity. Our insight is that in many cases the engines are run repeatedly against input constraints that encode problems that are related but of increasing complexity, and domain-specific knowledge can reduce the complexity. Moreover, even for one formula the engine may perform multiple expensive tasks with commonalities that can be estimated and exploited. We believe these relationships lay a foundation for making the engines more effective and scalable. We illustrate the viability of our idea in the context of a well-known solver for imperative constraints, and discuss how the idea generalizes to more general purpose methods.