This demo presents usage and implementation details of SLEMI. SLEMI is the first tool to automatically find compiler bugs in the widely used cyber-physical system development tool Simulink via Equivalence Modulo Input (EMI). EMI is a recent twist on differential testing that promises more efficiency. SLEMI implements several novel mutation techniques that deal with CPS language features that are not found in procedural languages. This demo also introduces a new EMI-based mutation strategy that has already found a new confirmed bug in Simulink version R2018a. To increase SLEMI's efficiency further, this paper presents parallel generation of random, valid Simulink models. A video demo of SLEMI is available at https://www.youtube.com/watch?v=oliPgOLT6eY.
Service robots, a type of robots that perform useful tasks for humans, are foreseen to be broadly used in the near future in both social and industrial scenarios. Those robots will be required to operate in dynamic environments, collaborating among them or with users. Specifying the list of requested tasks to be achieved by a robotic team is far from being trivial. Therefore, mission specification languages and tools need to be expressive enough to allow the specification of complex missions (e.g., detailing recovery actions), while being reachable by domain experts who might not be knowledgeable of programming languages. To support domain experts, we developed PROMISE, a Domain-Specific Language that allows mission specification for multiple robots in a user-friendly, yet rigorous manner. PROMISE is built as an Eclipse plugin that provides a textual and a graphical interface for mission specification. Our tool is in turn integrated into a software framework, which provides functionalities as: (1) automatic generation from specification, (2) sending of missions to the robotic team; and (3) interpretation and management of missions during execution time. PROMISE and its framework implementation have been validated through simulation and real-world experiments with four different robotic models.
We present a metamorphic testing tool that alleviates the oracle problem in security testing. The tool enables engineers to specify metamorphic relations that capture security properties of Web systems. It automatically tests Web systems to detect vulnerabilities based on those relations. We provide a domain-specific language accompanied by an Eclipse editor to facilitate the specification of metamorphic relations. The tool automatically collects the input data and transforms the metamorphic relations into executable Java code in order to automatically perform security testing based on the collected data. The tool has been successfully evaluated on a commercial system and a leading open source system (Jenkins). Demo video: https://youtu.be/9kx6u9LsGxs.
The use of mobile apps is increasingly widespread, and much effort is put into testing these apps to make sure they behave as intended. In this demo, we present AppTestMigrator, a technique and tool for migrating test cases between apps with similar functionality. The intuition behind AppTestMigrator is that many apps share similarities in their functionality, and these similarities often result in conceptually similar user interfaces (through which that functionality is accessed). AppTestMigrator attempts to automatically transform the sequence of events and oracles in a test case for an app (source app) to events and oracles for another app (target app). The results of our preliminary evaluation show the effectiveness of AppTestMigrator in migrating test cases between mobile apps with similar functionality.
Video URL: https://youtu.be/WQnfEcwYqa4
As blockchain becomes increasingly popular across various industries in recent years, many companies started designing and developing their own smart contract platforms to enable better services on blockchain. While smart contracts are notoriously known to be vulnerable to external attacks, such platform diversity further amplified the security challenge. To mitigate this problem, we designed the very first cross-platform security analyzer called Seraph for smart contracts. Specifically, Seraph enables automated security analysis for different platforms built on two mainstream virtual machine architectures, i.e., EVM and WASM. To this end, Seraph introduces a set of general connector API to abstract interactions between the virtual machine and blockchain, e.g., load and update storage data on blockchain. Moreover, we proposed the symbolic semantic graph to model critical dependencies and decoupled security analysis from contract code as well. Our preliminary evaluation on four existing smart contract platforms demonstrated the potential of Seraph in finding security threats both flexibly and accurately. A video of Seraph is available at https://youtu.be/wxixZkVqUsc.
Software repository mining is the foundation for many empirical software engineering studies. The collection and analysis of detailed data can be challenging, especially if data shall be shared to enable replicable research and open science practices. SmartSHARK is an ecosystem that supports replicable and reproducible research based on software repository mining.
Mutation testing can be used to assess the fault-detection capabilities of a given test suite. To this aim, two characteristics of mutation testing frameworks are of paramount importance: (i) they should generate mutants that are representative of real faults; and (ii) they should provide a complete tool chain able to automatically generate, inject, and test the mutants. To address the first point, we recently proposed an approach using a Recurrent Neural Network Encoder-Decoder architecture to learn mutants from ~787k faults mined from real programs. The empirical evaluation of this approach confirmed its ability to generate mutants representative of real faults. In this paper, we address the second point, presenting DeepMutation, a tool wrapping our deep learning model into a fully automated tool chain able to generate, inject, and test mutants learned from real faults.
Ensuring correct trace links between different types of artifacts (requirements, architecture, or code) is crucial for compliance in safety-critical domains, for consistency checking, or change impact assessment. The point in time when trace links are created, however, (i.e., immediately during development or weeks/months later) has a significant impact on its quality. Assessing quality thus relies on obtaining a historical view on artifacts and their trace links at a certain point in the past which provides valuable insights on when, how, and by whom, trace links were created. This work presents TimeTracer, a tool that allows engineers to go back in time - not just to view the history of artifacts but also the history of trace links associated with these artifacts. TimeTracer allows easy integration with different development support tools such as Jira; and it stores artifacts, traces, and changes thereof in a unified artifact model.
Establishing API mappings between libraries is a prerequisite step for library migration tasks. Manually establishing API mappings is tedious due to the large number of APIs to be examined, and existing methods based on supervised learning requires unavailable already-ported or functionality similar applications. Therefore, we propose an unsupervised deep learning based approach to embed both API usage semantics and API description (name and document) semantics into vector space for inferring likely analogical API mappings between libraries. We implement a proof-of-concept website SimilarAPI (https://similarapi.appspot.com) which can recommend analogical APIs for 583,501 APIs of 111 pairs of analogical Java libraries with diverse functionalities. Video: https://youtu.be/EAwD6l24vLQ
We present FeatureNET, an open-source Neural Architecture Search (NAS) tool1 that generates diverse sets of Deep Learning (DL) models. FeatureNET relies on a meta-model of deep neural networks, consisting of generic configurable entities. Then, it uses tools developed in the context of software product lines to generate diverse (maximize the differences between the generated) DL models. The models are translated to Keras and can be integrated into typical machine learning pipelines. FeatureNET allows researchers to generate seamlessly a large variety of models. Thereby, it helps choosing appropriate DL models and performing experiments with diverse models (mitigating potential threats to validity). As a NAS method, FeatureNET successfully generates models performing equally well with handcrafted models.
Recent studies have shown that the performance of deep learning models should be evaluated using various important metrics such as robustness and neuron coverage, besides the widely-used prediction accuracy metric. However, major deep learning frameworks currently only provide APIs to evaluate a model's accuracy. In order to comprehensively assess a deep learning model, framework users and researchers often need to implement new metrics by themselves, which is a tedious job. What is worse, due to the large number of hyper-parameters and inadequate documentation, evaluation results of some deep learning models are hard to reproduce, especially when the models and metrics are both new.
To ease the model evaluation in deep learning systems, we have developed EvalDNN, a user-friendly and extensible toolbox supporting multiple frameworks and metrics with a set of carefully designed APIs. Using EvalDNN, evaluation of a pre-trained model with respect to different metrics can be done with a few lines of code. We have evaluated EvalDNN on 79 models from TensorFlow, Keras, GluonCV, and PyTorch. As a result of our effort made to reproduce the evaluation results of existing work, we release a performance benchmark of popular models, which can be a useful reference to facilitate future research. The tool and benchmark are available at https://github.com/yqtianust/EvalDNN and https://yqtianust.github.io/EvalDNN-benchmark/, respectively. A demo video of EvalDNN is available at: https://youtu.be/v69bNJN2bJc.
Automated testing has been widely used to ensure the quality of Android applications. However, incomprehensible testing results make it difficult for developers to understand and fix potential bugs. This paper proposes FuRong, a novel tool, to fuse bug reports of high-readability and strong-guiding-ability via analyzing the automated testing results on multi-devices. FuRong builds a bug model with complete context information, such as screenshots, operation sequences, and logs from multi-devices, and then leverages pre-trained Decision Tree classifier (with 18 bug category labels) to classify bugs. FuRong deduplicates the classified bugs via Levenshtein distance and finally generates the easy-to-understand report, not only context information of bugs, where possible causes and fix suggestions for each bug category are also provided. An empirical study of 8 open-source Android applications with automated testing on 20 devices has been conducted, the results show the effectiveness of FuRong, which has a bug classification precision of 93.4% and a bug classification accuracy of 87.9%. Video URL: https://youtu.be/LUkFTc32B6k
One of the major drawbacks of traditional automatic program repair (APR) techniques is their dependence on a test suite as a repair specification. In practice, it is often hard to obtain specification-quality test suites. This limits the performance and hence the viability of such test-suite-based approaches. On the other hand, static-analysis-based bug finding tools are increasingly being adopted in industry but still facing challenges since the reported violations are viewed as not easily actionable. In previous work, we proposed a novel technique that solves both these challenges through a technique for automatically generating high-quality patches for static analysis violations by learning from previous repair examples. In this paper, we present a tool Phoenix, implementing this technique. We describe the architecture, user interfaces, and salient features of Phoenix, and specific practical use cases of its technology. A video demonstrating Phoenix is available at https://phoenix-tool.github.io/demo-video.html.
Recent research in empirical software engineering is applying techniques from neurocognitive science and breaking new grounds in the ways that researchers can model and analyze the cognitive processes of developers as they interact with software artifacts. However, given the novelty of this line of research, only one tool exists to help researchers represent and analyze this kind of multi-modal biometric data. While this tool does help with visualizing temporal eyetracking and physiological data, it does not allow for the mapping of physiological data to source code elements, instead projecting information over images of code. One drawback of this is that researchers are still unable to meaningfully combine and map physiological and eye tracking data to source code artifacts. The use of images also bars the support of long or multiple code files, which prevents researchers from analyzing data from experiments conducted in realistic settings. To address these drawbacks, we propose VITALSE, a tool for the interactive visualization of combined multi-modal biometric data for software engineering tasks. VITALSE provides interactive and customizable temporal heatmaps created with synchronized eyetracking and biometric data. The tool supports analysis on multiple files, user defined annotations for points of interest over source code elements, and high level customizable metric summaries for the provided dataset. VITALSE, a video demonstration, and sample data to demonstrate its capabilities can be found at http://www.vitalse.app.
Data-intensive scalable computing (DISC) systems such as Google's MapReduce, Apache Hadoop, and Apache Spark are prevalent in many production services. Despite their popularity, the quality of DISC applications suffers due to a lack of exhaustive and automated testing. Current practices of testing DISC applications are limited to using a small random sample of the entire input dataset which merely exposes any program faults. Unlike SQL queries, testing DISC applications has new challenges due to a composition of both dataflow and relational operators, and user-defined functions (UDF) that could be arbitrarily long and complex.
To address this problem, we demonstrate a new white-box testing framework called BigTest that takes an Apache Spark program as input and automatically generates synthetic, concrete data for effective and efficient testing. BigTest combines the symbolic execution of UDFs with the logical specifications of dataflow and relational operators to explore all paths in a DISC application. Our experiments show that BigTest is capable of generating test data that can reveal up to 2X more faults than the entire data set with 194X less testing time. We implement BigTest in a Java-based command line tool with a pre-compile binary jar. It exposes a configuration file in which a user can edit preferences, including the path of a target program, the upper bound of loop exploration, and a choice of theorem solver. The demonstration video of BigTest is available at https://youtu.be/OeHhoKiDYso and BigTest is available at https://github.com/maligulzar/BigTest.
Comprehensive test inputs are an essential ingredient for dynamic software analysis techniques, yet are typically impossible to obtain and maintain. Automated input generation techniques can supplant manual effort in many contexts, but they also exhibit inherent limitations in practical applications. Therefore, the best approach to input generation for a given application task necessarily entails compromise. Most symbolic execution approaches maintain soundness by sacrificing completeness. In this paper, we take the opposite approach and demonstrate PG-KLEE, an input generation tool that over-approximates program behavior to achieve complete coverage. We also summarize some empirical results that validate our claims. Our technique is detailed in an earlier paper , and the source code of PG-KLEE is available from .
Video URL: https://youtu.be/b1ajzW6YWds
Rotten green tests are passing tests which have at least one assertion that is not executed. They give developers a false sense of trust in the code. In this paper, we present RTj, a framework that analyzes test cases from Java projects with the goal of detecting and refactoring rotten test cases. RTj automatically discovered 418 rotten tests from 26 open-source Java projects hosted on GitHub. Using RTj, developers have an automated recommendation of the tests that need to be modified for improving the quality of the applications under test. A video is available at: https://youtu.be/Uqxf-Wzp3Mg
Understanding an unfamiliar program is always a daunting task for any programmer, either experienced or inexperienced. Many studies have shown that even an experienced programmer who is already familiar with the code may still need to rediscover the code frequently during software maintenance. The difficulties of program comprehension is much more intense when a system is completely new. One well-known solution to this notorious problem is to create effective technical documentation to make up for the lack of knowledge.
The purpose of technical documentation is to achieve the transfer of knowledge. However, creating effective technical documentation has been impeded by many problems in practice . In this paper, we propose a novel tool called GeekyNote to address the major challenges in technical documentation. The key ideas GeekyNote proposes are: (1) documents are annotated to versioned source code transparently; (2) formal textual writings are discouraged and screencasts (or other forms of documents) are encouraged; (3) the up-to-dateness between documents and code can be detected, measured, and managed; (4) the documentation that works like a debugging-trace is supported; (5) couplings can be easily created and managed for future maintenance needs; (6) how good a system is documented can be measured. A demo video can be accessed at https://youtu.be/cBueuPVDgWM.
With the rapid growth of Android devices, techniques that ensure high quality of mobile applications (i.e., apps) are receiving more and more attention. It is well-accepted that mutation analysis is an effective approach to simulate and locate realistic faults in the program. However, there exist few practical mutation analysis tools for Android apps. Even worse, existing mutation analysis tools tend to generate a large number of mutants, hindering broader adoption of mutation analysis, let alone the remaining high number of stillborn mutants. Additionally, mutation operators are usually pre-defined by such tools, leaving users less ability to define specific operators to meet their own needs. To address the aforementioned problems, we propose DroidMutator, a mutation analysis tool specifically for Android apps with configurability and extensibility. DroidMutator reduces the number of generated stillborn mutants through type checking, and the scope of mutation operators can be customized so that it only generates mutants in specific code blocks, thus generating fewer mutants with more concentrated purposes. Furthermore, it allows users to easily extend their mutation operators. We have applied DroidMutator on 50 open source Android apps and our experimental results show that DroidMutator effectively reduces the number of stillborn mutants and improves the efficiency of mutation analysis.
Demo link: https://github.com/SQS-JLiu/DroidMutator
Video link: https://youtu.be/dtD0oTVioHM
Systematic Literature Reviews (SLRs) have established themselves as a method in the field of software engineering. The aim of an SLR is to systematically analyze existing literature in order to answer a research question. In this paper, we present a tool to support an SLR process. The main focus of the SLR tool (https://www.slr-tool.com/) is to create and manage an SLR project, to import search results from search engines, and to manage search results by including or excluding each paper. A demo video of our SLR tool is available at https://youtu.be/Jan8JbwiE4k.
We present Nimbus, a framework for writing and deploying Java applications on a Function-as-a-Service ("serverless") platform. Nimbus aims to soothe four main pain points experienced by developers working on serverless applications: that testing can be difficult, that deployment can be a slow and painful process, that it is challenging to avoid vendor lock-in, and that long cold start times can introduce unwelcome latency to function invocations.
Nimbus provides a number of features that aim to overcome these challenges when working with serverless applications. It uses an annotation-based configuration to avoid having to work with large configuration files. It aims to allow the code written to be cloud-agnostic. It provides an environment for local testing where the complete application can be run locally before deployment. Lastly, Nimbus provides mechanisms for optimising the contents and size of the artifacts that are deployed to the cloud, which helps to reduce both deployment times and cold start times.
Software developed and verified using proof assistants, such as Coq, can provide trustworthiness beyond that of software developed using traditional programming languages and testing practices. However, guarantees from formal verification are only as good as the underlying definitions and specification properties. If properties are incomplete, flaws in definitions may not be captured during verification, which can lead to unexpected system behavior and failures. Mutation analysis is a general technique for evaluating specifications for adequacy and completeness, based on making small-scale changes to systems and observing the results. We demonstrate mCoq, the first mutation analysis tool for Coq projects. mCoq changes Coq definitions, with each change producing a modified project version, called a mutant, whose proofs are exhaustively checked. If checking succeeds, i.e., the mutant is live, this may indicate specification incompleteness. Since proof checking can take a long time, we optimized mCoq to perform incremental and parallel processing of mutants. By applying mCoq to popular Coq libraries, we found several instances of incomplete and missing specifications manifested as live mutants. We believe mCoq can be useful to proof engineers and researchers for analyzing software verification projects. The demo video for mCoq can be viewed at: https://youtu.be/QhigpfQ7dNo.
Message passing is the primary programming paradigm in high-performance computing. However, developing message passing programs is challenging due to the non-determinism caused by parallel execution and complex programming features such as non-deterministic communications and asynchrony. We present MPI-SV, a symbolic verifier for verifying the parallel C programs using message passing interface (MPI). MPI-SV combines symbolic execution and model checking in a synergistic manner to improve the scalability and enlarge the scope of verifiable properties. We have applied MPI-SV to real-world MPI C programs. The experimental results indicate that MPI-SV can, on average, achieve 19x speedups in verifying deadlock-freedom and 5x speedups in finding counter-examples. MPI-SV can be accessed at https://mpi-sv.github.io, and the demonstration video is at https://youtu.be/zzCY0CPDNCw.
To ensure interoperability and the correct behavior of heterogeneous distributed systems in key scenarios, it is important to conduct automated integration tests, based on distributed test components (called local testers) that are deployed close to the system components to simulate inputs from the environment and monitor the interactions with the environment and other system components. We say that a distributed test scenario is locally controllable and locally observable if test inputs can be decided locally and conformance errors can be detected locally by the local testers, without the need for exchanging coordination messages between the test components during test execution (which may reduce the responsiveness and fault detection capability of the test harness). DCO Analyzer is the first tool that checks if distributed test scenarios specified by means of UML sequence diagrams exhibit those properties, and automatically determines a minimum number of coordination messages to enforce them.
The demo video for DCO Analyzer can be found at https://youtu.be/LVIusK36_bs.
Deep learning (DL) systems, though being widely used, still suffer from quality and reliability issues. Researchers have put many efforts to investigate these issues. One promising direction is to leverage uncertainty, an intrinsic characteristic of DL systems when making decisions, to better understand their erroneous behavior. DL system testing is an effective method to reveal potential defects before the deployment into safety- and security-critical applications. Various techniques and criteria have been designed to generate defect-triggers, i.e. adversarial examples (AEs). However, whether these test inputs could achieve a full spectrum examination of DL systems remains unknown and there still lacks understanding of the relation between AEs and DL uncertainty. In this work, we first conduct an empirical study to uncover the characteristics of AEs from the perspective of uncertainty. Then, we propose a novel approach to generate inputs that are missed by existing techniques. Further, we investigate the usefulness and effectiveness of the data for DL robustness enhancement.
In this research, we investigate the effect of pair programming on the mind of software developers using data coming from EEG and how it effects the overall outcome of their task. For this research, we use EEG device to measure the brain-behavior relations of the developer and analyze the electromagnetic waves using ERD and correlation. We measure the concentration level, either it is high or low under three different cases: solo programming, pair programming (navigator) and pair programming (driver). The preliminary results of the analysis of pair programming confirms the higher concentration level as compared to solo programming.
Due to the rapid development of deep neural networks, in recent years, machine translation software has been widely adopted in people's daily lives, such as communicating with foreigners or understanding political news from the neighbouring countries. However, machine translation software could return incorrect translations because of the complexity of the underlying network. To address this problem, we introduce a novel methodology called PaInv for validating machine translation software. Our key insight is that sentences of different meanings should not have the same translation (i.e., pathological invariance). Specifically, PaInv generates syntactically similar but semantically different sentences by replacing one word in the sentence and filter out unsuitable sentences based on both syntactic and semantic information. We have applied PaInv to Google Translate using 200 English sentences as input with three language settings: English→Hindi, English→Chinese, and English→German. PaInv can accurately find 331 pathological invariants in total, revealing more than 100 translation errors.
Testing cyber-physical system (CPS) development tools such as MathWorks' Simulink is very important as they are widely used in design, simulation, and verification of CPS data-flow models. Existing randomized differential testing frameworks such as SLforge leverages semi-formal Simulink specifications to guide random model generation which requires significant research and engineering investment along with the need to manually update the tool, whenever MathWorks updates model validity rules. To address the limitations, we propose to learn validity rules automatically by learning a language model using our framework DeepFuzzSL from existing corpus of Simulink models. In our experiments, DeepFuzzSL consistently generate over 90% valid Simulink models and also found 2 confirmed bugs by MathWorks Support.
Modern, agile software development methods rely on iterative work and improvement cycles to deliver their claimed benefits. In Scrum, the most popular agile method, process improvement is implemented through regular Retrospective meetings. In these meetings, team members reflect on the latest development iteration and decide on improvement actions. To identify potential issues, data on the completed iteration needs to be gathered. The Scrum method itself does not prescribe these steps in detail. However, Retrospective games, i.e. interactive group activities, have been proposed to encourage the sharing of experiences and problems. These activities mostly rely on the collected perceptions of team members. However, modern software development practices produce a large variety of digital project artifacts, e.g. commits in version control systems or test run results, which contain detailed information on performed teamwork. We propose taking advantage of this information in new, data-driven Retrospective activities, allowing teams to gain additional insights based on their own team-specific data.
The popularity of Open Source Software (OSS) is at an all-time high and for it to remain so it is vital for new developers to continually join and contribute to the OSS community. In this paper, to better understand the first time contributor, we study the characteristics of the first pull request (PR) made to an OSS project by developers. We mine GitHub for the first OSS PR of 3501 developers to study certain characteristics of PRs like language and size. We find that over 1/3rd of the PRs were in Java while C++ was very unpopular. A large fraction of PRs didn't even involve writing code, and were a mixture of trivial and non-trivial changes.
This paper introduces type-aware mutation, a simple, but effective methodology for stress testing Satisfiability Modulo Theories (SMT) solvers. The key idea is mutating the operators of the formula to generate test inputs for differential testing, while considering the types of the operators to ensure the mutants are still valid. The realization of type-aware mutation was evaluated on finding bugs in two state-of-the-art SMT solvers, Z3 and CVC4. During the three months of empirical evaluation, 101 unique, confirmed bugs were found by type-aware mutation, and 87 of them have been fixed. The testing efforts and bugs were well-appreciated by the developers.
The egocentric bias describes the tendency to value one's own input and perspective higher than that of others. This phenomenon impacts collaboration and teamwork. However, current research on the subject concerning modern software development is lacking. We conducted a case study of 26 final year software engineering students and collected the perceptions of individual contributions to team efforts through regular surveys. We report evidence of an egocentric bias in engineering team members, which decreased over time. In contrast, we found no in-group bias, i.e. favoritism regarding contributions of own team members. We discuss our initial analyses and results, which we hypothesize can be explained by group cohesiveness as well as non-competition and group similarity, respectively.
Developers write logging statements to generate logs and record system execution behaviors to assist in debugging and software maintenance. However, there exists no practical guidelines on where to write logging statements. On one hand, adding too many logging statements may introduce superfluously trivial logs and performance overheads. On the other hand, logging too little may miss necessary runtime information. Thus, properly deciding the logging location is a challenging task and a finer-grained understanding of where to write logging statements is needed to assist developers in making logging decisions. In this paper, we conduct a comprehensive study to uncover guidelines on logging locations at the code block level. We analyze logging statements and their surrounding code by combining both deep learning techniques and manual investigations. From our preliminary results, we find that our deep learning models achieve over 90% in precision and recall when trained using the syntactic (e.g., nodes in abstract syntax tree) and semantic (e.g., variable names) features. However, cross-system models trained using semantic features only have 45.6% in precision and 73.2% in recall, while models trained using syntactic features still have over 90% precision and recall. Our current progress highlights that there is an implicit syntactic logging guideline across systems, and such information may be leveraged to uncover general logging guidelines.
Dockerfile plays an important role in the Docker-based software development process, but many Dockerfile codes are infected with quality issues in practice. Previous empirical studies showed the existence of association between code quality and project characteristics. However, the relationship between Dockerfile quality and project characteristics has never been explored. In this paper, we seek to empirically study this relation through a large dataset of 6,334 projects. Using linear regression analysis, when controlled for various variables, we statistically identify and quantify the relationship between Dockerfile quality and project characteristics.
Open source plays a critical role in our software infrastructure. It is used in the creation of almost every product and makes it increasingly easy to create powerful software cheaply and quickly, which many companies benefit from. However, its importance and our dependence on it, are often not recognized . Like all software projects, open source needs maintenance to fix bugs and adapt code to evolving technologies . With increasing popularity, demands for maintenance and support work also rise, resulting in many requests and reported issues. How to supply all of the needed maintenance and development work is an open and sometimes controversial question.
Game testing is a necessary but challenging task for gaming platforms. Current game testing practice requires significant manual effort. In this paper, we proposed an automated game testing framework combining adversarial inverse reinforcement learning algorithm with evolutionary multi-objective optimization. This framework aims to help gaming platform to assure market-wide game qualities as the framework is suitable to test different games with minimum manual customization for each game.
The software quality and reliability have been proved to be important during the program development. There are many existing studies trying to help improve it on bug detection and automated program repair processes. However, each of them has its own limitation and the overall performance still have some improvement space. In this paper, we proposed a deep learning framework to improve the software quality and reliability on these two detect-fix processes. We used advanced code modeling and AI models to have some improvements on the state-of-the-art approaches. The evaluation results show that our approach can have a relative improvement up to 206% in terms of F-1 score when comparing with baselines on bug detection and can have a relative improvement up to 19.8 times on the correct bug-fixing amount when comparing with baselines on automated program repair. These results can prove that our framework can have an outstanding performance on improving software quality and reliability in bug detection and automated program repair processes.
Web services often impose constraints that restrict the way in which two or more input parameters can be combined to form valid calls to the service, i.e. inter-parameter dependencies. Current web API specification languages like the OpenAPI Specification (OAS) provide no support for the formal description of such dependencies, making it hardly possible to interact with the services without human intervention. We propose specifying and automatically analyzing inter-parameter dependencies in web APIs. To this end, we propose a domain-specific language to describe these dependencies, a constraint programming-aided tool supporting their automated analysis, and an OAS extension integrating our approach and easing its adoption. Together, these contributions open a new range of possibilities in areas such as source code generation and testing.
Cyber-attacks stealing confidential information are becoming increasingly frequent and devastating as modern software systems store and manipulate greater amounts of sensitive data. Leaking information about private user data, such as the financial and medical records of individuals, trade secrets of companies and military secrets of states can have drastic consequences. Confidentiality of such private data is critical for users of these systems. Many software development practices, such as the encryption of packages sent over a network, aim to protect the confidentiality of private data by ensuring that an observer is unable to learn anything meaningful about a program's secret input from its public output. Under these protections, the software system's main communication channels, such as the content of the network packets it sends, or the output it writes to a public file, should not leak information about the private data. However, many software systems still contain serious security vulnerabilities. Side channels are an important class of information leaks where secret information can be captured through the observation of non-functional side effects of software systems. Potential side channels include those in execution time, memory usage, size and timings of network packets, and power consumption. Although side-channel vulnerabilities due to hardware (such as vulnerabilities that exploit the cache behavior) have been extensively studied [1, 2, 10, 13, 15-17, 19, 23], software side channels have only recently become an active area of research, including recent results on software side-channel detection [4, 8, 11, 12, 18, 22, 24] and quantification [5, 20, 21], and my own work on a static analysis framework for detection of software side-channels called CoCo-Channel ) and a constraint caching framework to accelerate side-channel quantification called Cashew .
Deep Learning (DL) based systems are utilized vastly. Developers update the code to fix the bugs in the system. How these code fixing techniques impacts the robustness of these systems has not been clear. Does fixing code increase the robustness? Do they deteriorate the learning capability of the DL based systems? To answer these questions, we studied 321 Stack Overflow posts based on a published dataset. In this study, we built a classification scheme to analyze how bug-fixes changed the robustness of the DL model and found that most of the bug-fixes can increase the robustness. We also found evidence of bug-fixing that decrease the robustness. Our preliminary result suggests that 12.5% and 2.4% of the bug-fixes in Stack Overflow posts caused the increase and the decrease of the robustness of DL models, respectively.
Test smell as analogous to code smell is a poor design choice in the implementation of test code. Recently, the concept of test smell has become the utmost interest of researchers and practitioners. Surveys show that developers' are aware of test smells and their potential consequences in the software system. However, there is limited empirical evidence for how developers address test smells during software evolution. Thus, in this paper, we study 2 research questions: (RQ1) How do test smells evolve? (RQ2) What is the motivation for removing test smells? Our result shows that Assertion Roulette, Conditional Test Logic and Unknown tests have a high rate of churns, the feature addition and improvement motivate refactoring, but test smell persists, implicating sub-optimal practice. In our study, we hope to fill the gap between academia and industry by providing evidence of sub-optimal practice in the way developers address test smells, and how it may be detrimental to the software.
Virtualization and containerization have been two disruptive technologies in the past few years. Both technologies have allowed isolating the applications with fewer resources and have impacted fields such as Software Testing. In the field of testing, the execution of the containerized/virtualized test suite has achieved great savings, but when the complexity increases or the cost of deployment rises, there are open challenges like the efficient execution of End to End (E2E) test suites. This paper proposes a research problem and a feasible solution that looks to improve resource usage in the E2E tests, towards smart resource identification and a proper organization of its execution in order to achieve efficient and effective resource usage. The resources are characterized by a series of attributes that provide information about the resource and its usage during the E2E testing phase. The test cases are grouped and scheduled with the resources (i.e. parallelized in the same machine or executed in a fixed arrangement), achieving an efficient test suite execution and reducing its total cost/time.
Blockchain-based decentralized application is becoming more widely accepted because it publicly runs on the blockchain and cannot be modified implicitly. However, the fact that only a few developers can master both blockchain and front-end programming skills results in the error-prone DApps especially when smart contracts has undergone a migration. The existing techniques rarely pay attention to DApps' automated migration. In this paper, we first summarized 6 migration categories and proposed an approach to figure out where changes are and its categories. Besides, we designed a function call graph structure to ensure mapping relationship accurate and compared it with distinctions between two versions of ABI to offer revising suggestions. We have developed an automated tool to implement our approach in real-world DApps and acquired positive preliminary evaluation results which illustrated the practical value in realizing DApps' automated migration.
Software engineering in the industrial automation domain requires generic methods to keep the development complexity at an acceptable level. However, nowadays various PLC vendors use different dialects of the standardized programming languages in their tools, which hinders the re-usability and interoperability across the platforms. The service-oriented approaches can serve to overcome interoperability issues. In distributed control systems, the functionality of an automation component can be offered to the other parties that constitute a production system via a standardized interface, easing the orchestration of the whole system. This paper proposes such a generic interface that hides away the low-level implementation details of a particular functionality and provides a common semantic model for the execution. Further, we show how using such an interface can help to support and automate the overall engineering process, combining the functionality of different components to fulfill a production task. The reference implementation of the proposed concept was used in an industrial demonstrator, which shows the benefits in the system flexibility due to components interoperability and re-usability compared to the traditional control approaches.
Program dependence is a fundamental concept to many software engineering tasks, yet the traditional dependence analysis struggles to cope with common modern development practices such as multi-lingual implementations and use of third-party libraries. While Observation-based Slicing (ORBS) solves these issues and produces an accurate slice, it has a scalability problem due to the need to build and execute the target program multiple times. We would like to propose a radical change of perspective: a useful dependence analysis needs to be scalable even if it approximates the dependency. Our goal is a scalable approximate program dependence analysis via estimating the likelihood of dependence. We claim that 1) using external information such as lexical analysis or a development history, 2) learning dependence model from partial observations, and 3) merging static, and observation-based approach would assist the proposition. We expect that our technique would introduce a new perspective of program dependence analysis into the likelihood of dependence. It would also broaden the capability of the dependence analysis towards large and complex software.
Problem: developers are increasingly adopting security practices in software projects in response to cyber threats. Despite the additional effort required to perform those practices, current cost models either do not consider security as an input or were not properly validated with empirical data. Hypothesis: increasing degrees of application of security practices and security features, motivated by security risks, lead to growing levels of added software development effort. Such an effort increase can be quantified through a parametric model that takes as input the usage degrees of security practices and requirements and outputs the additional software development effort. Contributions: the accurate prediction of secure software development effort will support the provision of a proper amount of resources to projects. We also expect that the quantification of the security effort will contribute to advance research on the cost-effectiveness of software security.
Empirical studies have shown that mobile applications that do not drain battery usually get good ratings from users. To make mobile application energy efficient many studies have been published that present refactoring guidelines and tools to optimize the code. However, these guidelines cannot be generalized w.r.t energy efficiency, as there is not enough energy related data for every context. Existing energy enhancement tools/profilers are mostly prototypes applicable to only a small subset of energy related problems. In addition, the existing guidelines and tools mostly address the energy issues once they have already been introduced. My goal is to add to the existing energy related data by evaluating the energy consumption of various code smell refactorings and third-party libraries used in Android development. Data from such evaluations could provide generalized contextual guidelines that could be used during application development to prevent the introduction of energy related problems. I also aim to develop a support tool for the Android Studio IDE that could give meaningful recommendations to developers during development to make the application code more energy efficient.
Adaptive and Learning Agents (ALAs) bring computational intelligence to their Cyber Physical host systems to adapt to novel situations encountered in their complex operational environment. They do so by learning from their experience to improve their performance. RTCA DO-178C specifies a stringent certification process for airborne software which represents several challenges when applied to an ALA in regards of functional completeness, functional correctness, testability and adaptability. This research claims that it is possible to certify an Adaptive Learning Unmanned Aerial Vehicle (UAV) Agent designed as per a Cognitive Architecture with current DO-178C certification process when leveraging a qualified tool (DO-330), Model-Based Development and Verification (DO-331) and Formal Methods (DO-333). The research consists in developing, as a case study, an ALA embedded in a UAV aimed at neutralizing rogue UAVs in the vicinity of civil airports and test it in the field. This article is the plan to complete, by end 2022, a dissertation currently in its confirmation phase.
Software application programming interfaces (APIs) are a ubiquitous part of Software Engineering. The evolution of these APIs requires constant effort from their developers and users alike. API developers must constantly balance keeping their products modern whilst keeping them as stable as possible. Meanwhile, API users must continually be on the lookout to adapt to changes that could break their applications. As APIs become more numerous, users are challenged by a myriad of choices and information on which API to use. Current research attempts to provide automatic documentation, code examples, and code completion to make API evolution more scalable for users. Our work will attempt to establish practical and scalable API evolution guidelines and tools based on public code repositories, to aid both API users and API developers.
This thesis focuses on investigating the use of public code repositories provided by the open-source community to improve software API engineering practices. More specifically, I seek to improve software engineering practices linked to API evolution, both from the perspective of API users and API developers. To achieve this goal, I will apply quantitative and qualitative research methods to understand the problems at hand. I will then mine public code repositories to develop novel solutions to these problems.
Refactoring tools automate tedious and error-prone source code changes. The prevalence and difficulty of refactorings in software development makes this a high-impact area for successful automation of manual operations. Automated refactorings tools can improve the speed and accuracy of software development and are easily accessible in many programming environments. Even so, developers frequently eschew automation in favor of manual refactoring and cite reasons like lack of support for real usage scenarios and unpredictable tools. In this paper, we propose to redesign refactoring operations into transformations that are useful and applicable in real software evolution scenarios with the help of repository mining and user studies.
Technical debt (TD), its impact on development and its consequences such as defects and vulnerabilities, are of common interest and great importance to software researchers and practitioners. Although there exist many studies investigating TD, the majority of them focuses on identifying and detecting TD from a single stage of development. There are also studies that analyze vulnerabilities focusing on some phases of the life cycle. Moreover, several approaches have investigated the relationship between TD and vulnerabilities, however, the generalizability and validity of findings are limited due to small dataset. In this study, we aim to identify TD through multiple phases of development, and to automatically measure it through data and text mining techniques to form a comprehensive feature model. We plan to utilize neural network based classifiers that will incorporate evolutionary changes on TD measures into predicting vulnerabilities. Our approach will be empirically assessed on open source and industrial projects.
Problem: The goal of a software product line is to aid quick and quality delivery of software products, sharing common features. Effectively achieving the above-mentioned goals requires reuse analysis of the product line features. Existing requirements reuse analysis approaches are not focused on recommending product line features, that can be reused to realize new customer requirements. Hypothesis: Given that the customer requirements are linked to product line features' description satisfying them: then the customer requirements can be clustered based on patterns and similarities, preserving the historic reuse information. New customer requirements can be evaluated against existing customer requirements and reuse of product line features can be recommended. Contributions: We treated the problem of feature reuse analysis as a text classification problem at the requirements-level. We use Natural Language Processing and clustering to recommend reuse of features based on similarities and historic reuse information. The recommendations can be used to realize new customer requirements.
It is well known that it is desirable to capture the most essential parts of software design meetings that take place at the whiteboard. It is equally well known, however, that actual capture rarely takes place. A few photos may be taken, informal notes might be scribbled down, and at best one of the developers may be tasked with creating a summary. Regardless, problems persist with important information being lost and, even when information is captured, that information not being easily located and accessed. To address these problems, I propose to design and evaluate a novel suite of tools that enables software designers working at the whiteboard to: (1) efficiently and in-the-moment capture important information produced during that meeting, and (2) be delivered, either by request or proactively by the tools, relevant information captured in the past when it is needed in a future design meeting.
Developers write logging statements to generate logs and record system execution behaviors. Such logs are widely used for a variety of tasks, such as debugging, testing, program comprehension, and performance analysis. However, there exists no practical guidelines on how to write logging statements; hence, making the logging decision a very challenging task. There are two main challenges that developers are facing while making logging decisions: 1) Difficult to accurately and succinctly record execution behaviors; and 2) Hard to decide where to write logging statements. This thesis proposes a series of approaches to address the problems and help developers make logging decisions in two aspects: assist in making decisions on logging contents and on logging locations. Through case studies on large-scale open source and commercial systems, we anticipate that our study will provide useful suggestions and supports to developers for writing better logging statements.
Testing of web APIs is nowadays more critical than ever before, as they are the current standard for software integration. A bug in an organization's web API could have a huge impact both internally (services relying on that API) and externally (third-party applications and end users). Most existing tools and testing approaches require writing tests or instrumenting the system under test (SUT). The main aim of this dissertation is to take web API testing to an unprecedented level of automation and thoroughness. To this end, we plan to apply artificial intelligence (AI) techniques for the autonomous detection of software failures. Specifically, the idea is to develop intelligent programs (we call them "bots") capable of generating hundreds, thousands or even millions of test inputs and to evaluate whether the test outputs are correct based on: 1) patterns learned from previous executions of the SUT; and 2) knowledge gained from analyzing thousands of similar programs. Evaluation results of our initial prototype are promising, with bugs being automatically detected in some real-world APIs.
Performance is an important aspect of software quality. The goals of performance are typically defined by setting upper and lower bounds for response time and throughput of a system and physical level measurements such as CPU, memory and I/O. To meet such performance goals, several performance-related activities are needed in development (Dev) and operations (Ops). In fact, large software system failures are often due to performance issues rather than functional bugs. One of the most important performance issues is performance regression. Although performance regressions are not all bugs, they often have a direct impact on users' experience of the system. The process of detection of performance regressions in development and operations is faced with challenges. First, the detection of performance regression is conducted after the fact, i.e., after the system is built and deployed in the field or dedicated performance testing environments. Large amounts of resources are required to detect, locate, understand and fix performance regressions at such a late stage in the development cycle. Second, even we can detect a performance regression, it is extremely hard to fix it because other changes are applied to the system after the introduction of the regression. These challenges call for further in-depth analyses of the performance regression. In this dissertation, to avoid performance regression slipping into operation, we first perform an exploratory study on the source code changes that introduce performance regressions in order to understand root-causes of performance regression in the source code level. Second, we propose an approach that automatically predicts whether a test would manifest performance regressions in a code commit. To assist practitioners to analyze system performance with operational data, we propose an approach to recovering field-representative workload that can be used to detect performance regression. We also propose that using execution logs generated by unit tests to predict performance regression in load tests.
While there is not much discussion on the importance of formally describing and analyzing quantitative requirements in the process of software construction; in the paradigm of API-based software systems, it could be vital. Quantitative attributes can be thought of as attributes determining the Quality of Service - QoS provided by a software component published as a service. In this sense, they play a determinant role in classifying software artifacts according to specific needs stated as requirements.
In this work, we present a research program consisting of the development of formal languages and tools to characterize and analyze the Quality of Service attributes of software components in the context of distributed systems. More specifically, our main motivational scenario lays on the execution of a service-oriented architecture.
Identifying the source of a program failure plays an integral role in maintaining software quality. Both fault localisation and defect prediction aim to locate faults: fault localisation aims to locate faults after they are revealed while defect prediction aims to locate yet-to-happen faults. Despite sharing a similar goal, fault localisation and defect prediction have been studied as separate topics, mainly due to the difference in available data to exploit. In our doctoral research, we aim to bridge fault localisation and defect prediction. Our work is divided into three parts: 1) applying defect prediction to fault localisation, i.e., DP2FL, 2) applying fault localisation to defect prediction, i.e., FL2DP, 3) consecutive application of DP2FL and FL2DP in a single framework. We expect the synergy between fault localisation and defect prediction not only to improve the accuracy of each process but to allow us to build a single model that gradually improve the overall software quality throughout the entire software development life-cycle.
Software testing prevents and detects the introduction of faults and bugs during the process of evolving and delivering reliable software. As an important software development activity, testing has been intensively studied to measure test code quality and effectiveness, and assist professional developers and testers with automated test generation tools. In recent years, testing has been attracting educators' attention and has been integrated into some Computer Science education programs. Understanding challenges and problems faced by students can help inform educators the topics that require extra attention and practice when presenting testing concepts and techniques.
In my research, I study how students implement and modify source code given unit tests, and how they perceive and perform unit testing. I propose to quantitatively measure the quality of student-written test code, and qualitatively identify the common mistakes and bad smells observed in student-written test code. We compare the performance of students and professionals, who vary in prior testing experience, to investigate the factors that lead to high-quality test code. The ultimate goal of my research is to address the challenges students encountered during test code composition and improve their testing skills with supportive tools or guidance.
We learned from the history of software that great software are the ones who manage to sustain their quality. Free and open source software (FOSS) has become a serious software supply channel. However, trust on FOSS products is still an issue. Quality is a trait that enhances trust. In my study, I investigate the following question: how do FOSS communities sustain their software quality? I argue that human and social factors contribute to the sustainability of quality in FOSS communities. Amongst these factors are: the motivation of participants, robust governance style for the software change process, and the exercise of good practices in the pull requests evaluation process.
In modern software engineering, developers have to work with constantly evolving, interconnected software systems. Understanding how and why these systems and their dependencies between each other change is therefore an essential step in improving or maintaining them. For this, it is important to know what changed and how these changes influence the system. Most currently used tools that help developers to understand source code changes either use the textual representation of source code, allowing for a coarse-grained overview, or use the AST (abstract syntax tree) representation of source code to extract more fine-grained changes. We plan to improve the accuracy and classification of the extracted source code changes and to extend them by analysing the fine-grained changes of source code dependencies. We also propose a dynamical analysis of the impact of the previously extracted changes on performance metrics. This helps to understand what changes caused a certain change in program behaviour. We plan to use and combine this information to generate accurate and detailed change overviews that bridge the gap between existing coarse-grained solutions and the raw changes contained in the code, aiming to reduce the developers' time spent reading changed code and help them to quickly understand the changes between two versions of source code.
Despite their growing popularity, apps tend to contain defects which can ultimately manifest as failures (or crashes) to end-users. Different automated tools for testing Android apps have been proposed in order to improve software quality. Although Genetic Algorithms and Evolutionary Algorithms (EA) have been promising in recent years, in light of recent results, it seems they are not yet fully tailored to the problem of Android test generation. Thus, this thesis aims to design and evaluate algorithms for alleviating the burden of testing Android apps. In particular, I plan to investigate which is the best search-based algorithm for this particular problem. As the thesis advances, I expect to develop a fully open-source test case generator for Android applications that will serve as a framework for comparing different algorithms. These algorithms will be compared using statistical analysis on both open-source (i.e., from F-Droid) and commercial applications (i.e., from Google Play Store).
Software developers are increasingly having conversations about software development via online chat services. Many of those chat communications contain valuable information, such as code descriptions, good programming practices, and causes of common errors/exceptions. However, the nature of chat community content is transient, as opposed to the archival nature of other developer communications such as email, bug reports and Q&A forums. As a result, important information and advice are lost over time.
The focus of this dissertation is Extracting Archival Information from Software-Related Chats, specifically to (1) automatically identify conversations that contain archival-quality information, (2) accurately reduce the granularity of the information reported as archival information, and (3) conduct a case study to investigate how archival quality information extracted from chats compare to related posts in Q&A forums. Archiving knowledge from developer chats could be used potentially in several applications such as: creating a new archival mechanism available to a given chat community, augmenting Q&A forums, or facilitating the mining of specific information and improving software maintenance tools.
Context: Software has become ubiquitous in every corner of modern societies. During the last five decades, software engineering has also changed significantly to advance the development of various types and scales of software products. In this context, Software Engineering Education plays an essential role in keeping students updated with software technologies, processes, and practices that are popular in industries. Aim: In this PhD work, I want to answer the following research questions: To what extent software engineering trends are present in software engineering education? In which way software startup in growth phase characteristics can be transferred into software engineering education context? What is the impact of software startup engineering in the curriculum and to software engineering students? Method: I utilize literature review and mix-methods approaches (quantitative and qualitative data and methods triangulation) in gathering empirical evidence. More precisely, I split my research method into two phases. The first phase of the research will acquire knowledge and insight based on the existing literature review. The second research phase will split the focus in two directions. Firstly, I shall gather empirical evidence on how software startup practices are present in software engineering education. Secondly, I will conduct parallel investigations into SE practices in growth phase software startups. Expected Results: I argue that software startup engineering practices are an ultimate tool for software engineering education approaches. I expect students to acquire software engineering skills in a more realistic context while using software startup in growth phase practices.
Technical debt (TD) is an economical term used to depict non-optimal choices made in the software development process. It occurs usually when developers take shortcuts instead of following agreed upon development practices, and unchecked growth of technical debt can start to incur negative effects for software development processes.
Technical debt detection and management is mainly done manually, and this is both slow and costly way of detecting technical debt. Automatic detection would solve this issue, but even state-of-the-art tools of today do not accurately detect the appearance of technical debt. Therefore, increasing the accuracy of automatic classification is of high importance, so that we could eliminate significant portion from the costs relating to technical debt detection.
This research aims to solve the problem in detection accuracy by bringing in together static code analysis and natural language processing. This combination of techniques will allow more accurate detection of technical debt, when compared to them being used separately from each other. Research also aims to discover themes and topics from written developer messages that can be linked to technical debt. These can help us to understand technical debt from developers' viewpoint. Finally, we will build an open-source tool/plugin that can be used to accurately detect technical debt using both static analysis and natural language processing methods.
Data modeling in Cassandra databases follows a query-driven approach where each table is created to satisfy a query, leading to repeated data as the Cassandra model is not normalized by design. Consequently, developers bear the responsibility to maintain the data integrity at the application level, as opposed to when the model is normalized. This is done by embedding in the client application the appropriate statements to perform data changes, which is error prone. Cassandra data modeling methodologies have emerged to cope with this problem by proposing the use of a conceptual model to generate the logical model, solving the data modeling problem but not the data integrity one. In this thesis we address the problem of the integrity of these data by proposing a method that, given a data change at either the conceptual or the logical level, determines the executable statements that should be issued to preserve the data integrity. Additionally, as this integrity may also be lost as a consequence of creating new data structures in the logical model, we complement our method to preserve data integrity in these scenarios. Furthermore, we address the creation of data structures at the conceptual level to represent a normalized version of newly created data structures in the logical model.
Many developers don't understand how to, or recognize the need to develop accessible software. To address this, we have created five educational Accessibility Learning Labs (ALL) using an experiential learning structure. Each of these labs addresses a foundational concept in computing accessibility and both inform participants about foundational concepts in creating accessible software while also demonstrating the necessity of creating accessible software. The hosted labs provide a complete educational experience, containing materials such as lecture slides, activities, and quizzes.
We evaluated the labs in ten sections of a CS2 course at our university, with 276 students participating. Our primary findings include: I) The labs are an effective way to inform participants about foundational topics in creating accessible software II) The labs demonstrate the potential benefits of our proposed experiential learning format in motivating participants about the importance of creating accessible software III) The labs demonstrate that empathy material increases learning retention. Created labs and project materials are publicly available on the project website: http://all.rit.edu
We present Precfix, a pragmatic approach targeting large-scale industrial codebase and making recommendations based on previously observed debugging activities. Precfix collects defect-patch pairs from development histories, performs clustering, and extracts generic reusable patching patterns as recommendations. Our approach is able to make recommendations within milliseconds and achieves a false positive rate of 22%. Precfix has been rolled out to Alibaba to support various critical businesses.
Unfair work distribution is common in project-based learning with teams of students. One contributing factor is that students are differently skilled developers. To mitigate the differences in a course with group work, we introduced mandatory programming lab sessions. The intervention did not affect the work distribution, showing that more is needed to balance the workload. Contrary to our goal, the intervention was very well received among experienced students, but unpopular with students weak at programming.
Data analytics application development introduces many challenges including: new roles not in traditional software engineering practices - e.g. data scientists and data engineers; use of sophisticated machine learning (ML) model-based approaches; uncertainty inherent in the models; interfacing with models to fulfill software functionalities; deploying models at scale and rapid evolution of business goals and data sources. We describe our Big Data Analytics Modeling Languages (BiDaML) toolset to bring all stakeholders around one tool to specify, model and document big data applications. We report on our experience applying BiDaML to three real-world large-scale applications. Our approach successfully supports complex data analytics application development in industrial settings.
Due to the growing value of software technology in our everyday life, young professionals and undergraduates need to be well-qualified for Software Engineering (SE) careers. Additionally, its didactic basis is a recent development.
DevOps stands for Development-Operations. It arises from the IT industry as a movement aligning development and operations teams. DevOps is broadly recognized as an IT standard, and there is high demand for DevOps practitioners in industry. Therefore, we studied whether undergraduates acquired adequate DevOps skills to fulfill the demand for DevOps practitioners in industry. We employed Grounded Theory (GT), a social science qualitative research methodology, to study DevOps education from academic and industrial perspectives. In academia, academics were not motivated to learn or adopt DevOps, and we did not find strong evidence of academics teaching DevOps. Academics need incentives to adopt DevOps, in order to stimulate interest in teaching DevOps. In industry, DevOps practitioners lack clearly defined roles and responsibilities, for the DevOps topic is diverse and growing too fast. Therefore, practitioners can only learn DevOps through hands-on working experience. As a result, academic institutions should provide fundamental DevOps education (in culture, procedure, and technology) to prepare students for their future DevOps advancement in industry. Based on our findings, we proposed five groups of future studies to advance DevOps education in academia.
Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, since it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover useful alerts accurately.
A diverse workforce is not just "nice to have", it is a reflection of a changing world. Such a diverse workforce brings high value to organizations and it is essential for developing the national technological innovation, economic vitality, and global competitiveness. Despite the importance of diversity in the broad field of computing, there is not only a comparatively low representation of women but also other underrepresented minorities, such as indigenous people. To gain insights about their career choice, we conducted 10 interviews with Andean indigenous. The findings reveal that seven factors (social support, exposure to digital technology, autonomy of use, purpose of use, digital skill, identity, and work ethic) help to understand how and why indigenous people choose a career related to Software Engineering. This exploratory study also contributes to challenge common stereotypes and perceptions about indigenous people as low-qualified workers, academically untalented, and unmotivated.
Data scientists frequently analyze data by writing scripts. We conducted a contextual inquiry with interdisciplinary researchers, which revealed that parameter tuning is a highly iterative process and that debugging is time-consuming. As analysis scripts evolve and become more complex, analysts have difficulty conceptualizing their workflow. In particular, after editing a script, it becomes difficult to determine precisely which code blocks depend on the edit. Consequently, scientists frequently re-run entire scripts instead of re-running only the necessary parts. We present ProvBuild, a data analysis environment that uses change impact analysis  to improve the iterative debugging process in script-based workflow pipelines. ProvBuild is a tool that leverages language-level provenance  to streamline the debugging process by reducing programmer cognitive load and decreasing subsequent runtimes, leading to an overall reduction in elapsed debugging time. ProvBuild uses provenance to track dependencies in a script. When an analyst debugs a script, ProvBuild generates a simplified script that contains only the information necessary to debug a particular problem. We demonstrate that debugging the simplified script lowers a programmer's cognitive load and permits faster re-execution when testing changes. The combination of reduced cognitive load and shorter runtime reduces the time necessary to debug a script. We quantitatively and qualitatively show that even though ProvBuild introduces overhead during a script's first execution, it is a more efficient way for users to debug and tune complex workflows. ProvBuild demonstrates a novel use of language-level provenance, in which it is used to proactively improve programmer productively rather than merely providing a way to retroactively gain insight into a body of code. To the best of our knowledge, ProvBuild is a novel application of change impact analysis and it is the first debugging tool to leverage language-level provenance to reduce cognitive load and execution time.
The notion of forking has changed with the rise of distributed version control systems and social coding environments, like GitHub. Traditionally forking refers to splitting off an independent development branch (which we call hard forks); research on hard forks, conducted mostly in pre-GitHub days showed that hard forks were often seen critical as they may fragment a community. Today, in social coding environments, open-source developers are encouraged to fork a project in order to contribute to the community (which we call social forks), which may have also influenced perceptions and practices around hard forks. To revisit hard forks, we identify, study, and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that, among others, hard forks often evolve out of social forks rather than being planned deliberately and that perception about hard forks have indeed changed dramatically, seeing them often as a positive non-competitive alternative to the original project.
With the increasing deployment of enterprise-scale distributed systems, effective and practical defenses for such systems against various security vulnerabilities such as sensitive data leaks are urgently needed. However, most existing solutions are limited to centralized programs. For real-world distributed systems which are of large scales, current solutions commonly face one or more of scalability, applicability, and portability challenges. To overcome these challenges, we develop a novel dynamic taint analysis for enterprise-scale distributed systems. To achieve scalability, we use a multi-phase analysis strategy to reduce the overall cost. We infer implicit dependencies via partial-ordering method events in distributed programs to address the applicability challenge. To achieve greater portability, the analysis is designed to work at an application level without customizing platforms. Empirical results have shown promising scalability and capabilities of our approach.
We propose a methodology to study and visualize the evolution of the modular structure of a network of functional dependencies in a software system. Our method identifies periods of significant refactoring activities, also known as the evolutionary hot spots in software systems. Our approach is based on clustering design structure matrices of functional dependencies and Kleinberg's method of identifying evolutionary hot-spots in dynamic networks. As a case study, we characterize the evolution of the modular structure of Octave over its entire life cycle.
Reentrancy bugs in smart contracts caused a devastating financial loss in 2016, considered as one of the most severe vulnerabilities in smart contracts. Most of the existing general-purpose security tools for smart contracts have claimed to be able to detect reentrancy bugs. In this paper, we present Clairvoyance, a cross-function and cross-contract static analysis by identifying infeasible paths to detect reentrancy vulnerabilities in smart contracts. To reduce FPs, we have summarized five major path protective techniques (PPTs) to support fast yet precise path feasibility checking. We have implemented our approach and compared Clairvoyance with three state-of-the-art tools on 17770 real-worlds contracts. The results show that Clairvoyance yields the best detection accuracy among all the tools.
The computing education community has shown a long-time interest in how to analyze the Object-Oriented (OO) source code developed by students to provide them with useful formative tips. In this paper, we propose and evaluate an approach to analyze how students use Java and its language constructs. The approach is implemented through a cloud-based integrated development environment (IDE) and it is based on the analysis of the most common violations of the OO paradigm in the student source code. Moreover, the IDE supports the automatic generation of reports about student's mistakes and misconceptions that can be used by instructors to improve the course design. The paper discusses the preliminary results of an experiment performed in a class of a Programming II course to investigate the effects of the provided reports in terms of coding ability (concerning the correctness of the produced code).
Many automated test generation tools were proposed for finding bugs in Android apps. However, a recent study revealed that developers prefer reading automated test generation cased written in natural language. We present Bugine, a new bug recommendation system that automatically selects relevant bug reports from other applications that have similar bugs. Bugine (1) searches for GitHub issues that mentioned common UI components shared between the app under test and the apps in our database, and (2) ranks the quality and relevance of issues. Our results show that Bugine could find 34 new bugs in five evaluated apps.
With the prosperity of Android, developers need to deal with the compatibility issues among different devices, which is costly. In this paper, we propose an automated and general approach named ICARUS to identify compatibility-related APIs in Android apps. The insight of our approach is that the compatibility-related API has the biased distribution among code segments, which is similar to the distribution of keywords among documents. It motivates us to leverage statistical features to discriminate compatibility-related APIs from normal APIs. Experimental results on apps demonstrate the effectiveness of our work.
Even though Lean principles have already been broadly applied to the manufacturing industry , we cannot say the same regarding software development. The objective of this article is therefore to present a real experience where the Lean Kanban method  was applied by a software development team from an IT consulting firm. The team (7 people) is responsible for the maintenance of internal management applications at a large governmental organization (over 4,000 employees). It had to combine new evolutionary developments with corrective maintenance and incident resolution within the production area of 20 to 25 information systems with heterogeneous purposes and technologies.
Developers are known to keep third-party dependencies of their projects outdated even if some of them are affected by known vulnerabilities. In this study we aim to understand why they do so. For this, we conducted 25 semi-structured interviews with developers of both large and small-medium enterprises located in nine countries. All interviews were transcribed, coded, and analyzed according to applied thematic analysis. The results of the study reveal important aspects of developers' practices that should be considered by security researchers and dependency tool developers to improve the security of the dependency management process.
Just because software developers say they believe in "X", that does not necessarily mean that "X" is true. As shown here, there exist numerous beliefs listed in the recent Software Engineering literature which are only supported by small portions of the available data. Hence we ask what is the source of this disconnect between beliefs and evidence?.
To answer this question we look for evidence for ten beliefs within 300,000+ changes seen in dozens of open-source projects. Some of those beliefs had strong support across all the projects; specifically, "A commit that involves more added and removed lines is more bug-prone" and "Files with fewer lines contributed by their owners (who contribute most changes) are bug-prone".
Most of the widely-held beliefs studied are only sporadically supported in the data; i.e. large effects can appear in project data and then disappear in subsequent releases. Such sporadic support explains why developers believe things that were relevant to their prior work, but not necessarily their current work.
Jupyter notebooks---documents that contain live code, equations, visualizations, and narrative text---now are among the most popular means to compute, present, discuss and disseminate scientific findings. In principle, Jupyter notebooks should easily allow to reproduce and extend scientific computations and their findings; but in practice, this is not the case. The individual code cells in Jupyter notebooks can be executed in any order, with identifier usages preceding their definitions and results preceding their computations. In a sample of 936 published notebooks that would be executable in principle, we found that 73% of them would not be reproducible with straightforward approaches, requiring humans to infer (and often guess) the order in which the authors created the cells.
In this paper, we present an approach to (1) automatically satisfy dependencies between code cells to reconstruct possible execution orders of the cells; and (2) instrument code cells to mitigate the impact of non-reproducible statements (i.e., random functions) in Jupyter notebooks. Our Osiris prototype takes a notebook as input and outputs the possible execution schemes that reproduce the exact notebook results. In our sample, Osiris was able to reconstruct such schemes for 82.23% of all executable notebooks, which has more than three times better than the state-of-the-art; the resulting reordered code is valid program code and thus available for further testing and analysis.
Requirements Engineering (RE) involves critical activities to ensure the accurate elicitation and documentation of clients' requirements. RE is a socio-technical activity and requires intensive communication with several clients. RE activities might be considerably influenced by individuals' cultural background because culture has a deep impact on the way in which people communicate. However, there has been limited exploration of this issue. We present a framework that identifies and analyses cultural influences on RE activities. To build the framework, we employed Hofstede's cultural model and a mixed-methods design comprising two case studies involving two cultures: Saudi Arabia and Australia. The evaluation highlighted that the framework provides high accuracy to identify cultural influences in other cultures as well.
Issue triage is a manual and time consuming process for both open and closed source software projects. Triagers first validate the issue reports and then find the appropriate developers or teams to solve them. In our industrial case, we automated the assignment part of the problem with a machine learning based approach. However, the automated system's average accuracy performance is 3% below the human triagers' performance. In our effort to improve our approach, we analyzed the incorrectly assigned issue reports and realized that many of them have attachments with them, which are mostly screenshots. Such issue reports generally have short descriptions compared to the ones without attachments, which we consider as one of the reasons for incorrect classification. In this study, we describe our proposed approach to include this new piece of information for issue triage and present the initial results.
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.
In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work, and outperforms the state of the art. To our knowledge, this is the largest NLM for code that has been reported.
Based on Grounded Theory guidelines, we interviewed 27 IT professionals to investigate how organizations pursuing continuous delivery should organize their development and operations teams. In this paper, we present the discovered organizational structures: (1) siloed departments, (2) classical DevOps, (3) cross-functional teams, and (4) platform teams.
With the global increase in demand for online tertiary education, teachers are facing unique challenges in scaling assessment activities and meaningful student engagement. One such aspect is contract cheating behaviours exhibited in the modern online environment --- posing a threat to the academic integrity of tertiary education. These obstacles amplify when applied to traditionally difficult domains like introductory programming education. Prior research on contract cheating identification proposes that while challenging, techniques such as developing strong teacher-student relationships, and real-time discussions may lead to instances of identifying contract cheating behaviours. The proposition, then, is to scale real-time, student-teacher discussions with large, online cohorts --- similar to those discussions which traditionally took place in the classroom. This poster paper presents Intelligent Discussion Comments (IDCs): A scalable, teacher-asynchronous system which engages students in real-time discussions to extract authentic student understanding. Artificial intelligence services such as voice identification and transcription enrich the discussion process, supporting the teaching team in their decision-making process.
Program failures are often caused by invalid inputs, for instance due to input corruption. To obtain the passing input, one needs to debug the data. In this paper we present a generic technique called ddmax that (1) identifies which parts of the input data prevent processing, and (2) recovers as much of the (valuable) input data as possible. To the best of our knowledge, ddmax is the first approach that fixes faults in the input data without requiring program analysis. In our evaluation, ddmax repaired about 69% of input files and recovered about 78% of data within one minute per input.
There are often constraints associated with data used in software, describing the expected length, value, uniqueness, and other properties of the stored data. Correctly specifying and checking such constraints are crucial for reliability, maintainability, and usability of software. This is particularly important for database-backed web applications, where a huge amount of data generated by millions of users plays a central role in user interaction and application logic. Furthermore, such data persists in database and needs to continue serving users despite frequent software upgrades  and data migration . As a result, consistently and comprehensively specifying data constraints, checking them, and handling constraint violations are of uttermost importance.
We found that many of the reported erroneous cases in popular DNN image classifiers occur because the trained models confuse one class with another or show biases towards some classes over others. Most existing DNN testing techniques focus on per-image violations, so fail to detect class-level confusions or biases. We developed a testing technique to automatically detect class-based confusion and bias errors in DNN-driven image classification software. We evaluated our implementation, DeepInspect, on several popular image classifiers with precision up to 100% (avg. 72.6%) for confusion errors, and up to 84.3% (avg. 66.8%) for bias errors.
Context: Programmers frequently look for the code of previously solved problems that they can adapt for their own problem. Despite existing example code on the web, on sites like Stack Overflow, cryptographic Application Programming Interfaces (APIs) are commonly misused. There is little known about what makes examples helpful for developers in using crypto APIs. Analogical problem solving is a psychological theory that investigates how people use known solutions to solve new problems. There is evidence that the capacity to reason and solve novel problems a.k.a Fluid Intelligence (Gf) and structurally and procedurally similar solutions support problem solving. Aim: Our goal is to understand whether similarity and Gf also have an effect in the context of using cryptographic APIs with the help of code examples. Method: We conducted a controlled experiment with 76 student participants developing with or without procedurally similar examples, one of two Java crypto libraries and measured the Gf of the participants as well as the effect on usability (effectiveness, efficiency, satisfaction) and security bugs. Results: We observed a strong effect of code examples with a high procedural similarity on all dependent variables. Fluid intelligence Gf had no effect. It also made no difference which library the participants used. Conclusions: Example code must be more highly similar to a concrete solution, not very abstract and generic to have a positive effect in a development task.
Mobile app users post their opinion about the apps, report bugs or request features on various platforms, the main one being App Stores. Previous research suggests that Twitter should be used as an additional resource to receive users' feedback, as app users tweet different issues. Although the classification and review summarization methods are developed previously for each platform separately, manual investigation of reviews or tweets is still required to identify the similar or different points that are discussed on App Store or Twitter. In this paper, we propose a framework to study the differences or similarities among app reviews from Google Play Store and tweets automatically by using the semantics of the words. The results from several experiments compared with expert evaluation, confirm that it can be applied to identify the similarities or differences among the extracted topics, n-grams, and users' comments.
Symbolic execution is a powerful technique for systematically exploring program paths, but scaling symbolic execution to practical programs remains challenging. State-of-the-art techniques face the challenge to efficiently explore incremental behaviors, especially for highly coupled programs with complex control and data dependency. In this paper, we present a novel approach for incremental symbolic execution based on an iteration loop between path exploration and path suffixes summarization. On one hand, the explored paths are summarized to enable more precise identification of affected paths; on the other hand, the summary guides path exploration to prune paths that have no incremental behaviors. We implemented the prototype of our approach and conducted experiments on a set of real-world applications. The results show that it is efficient and effective in exploring incremental behaviors.
OSS ecosystems promote code reuse, and knowledge sharing across projects within them. An ecosystem's developers often develop similar activity patterns which might impact project outcomes in an ecosystem-specific way. Since elite developers play critical roles in most OSS projects, investigating their behaviors at the ecosystem level becomes urgent. Thus, we aim to investigate elite developers' activities and their relationships with project outcomes (productivity and quality). We design an large scale empirical study which characterizes elite developers' activity profiles and identifies the relationships between their effort allocations and project outcomes across five ecosystems. Our current results and findings reveal that elite developers in each ecosystem do behave in ecosystem-specific ways. Further, we find that the elites' effort allocations on different activity categories are potentially correlated with project outcomes.
The need for mobile applications and mobile programming is increasing due to the continuous rise in the pervasiveness of mobile devices. Developers often refer to video programming tutorials to learn more about mobile programming topics. To find the right video to watch, developers typically skim over several videos, looking at their title, description, and video content in order to determine if they are relevant to their information needs. Unfortunately, the title and description do not always provide an accurate overview, and skimming over videos is time-consuming and can lead to missing important information. We propose a novel approach that locates and extracts the GUI screens showcased in a video tutorial, then selects and displays the most representative ones to provide a GUI-focused overview of the video. We believe this overview can be used by developers as an additional source of information for determining if a video contains the information they need. To evaluate our approach, we performed an empirical study on iOS and Android programming screencasts which investigates the accuracy of our automated GUI extraction. The results reveal that our approach can detect and extract GUI screens with an accuracy of 94%.
We present DLFix, a two-layer tree-based model learning bug-fixing code changes and their surrounding code context to improve Automated Program Repair (APR). The first layer learns the surrounding code context of a fix and uses it as weights for the second layer that is used to learn the bug-fixing code transformation. Our empirical results on Defect4J show that DLFix can fix 30 bugs and its results are comparable and complementary to the best performing pattern-based APR tools. Furthermore, DLFix can fix 2.5 times more bugs than the best performing deep learning baseline.
Developer forums are one of the most popular and useful Q&A websites on API usages. The analysis of API forums can be a critical step towards automated question and answer approaches. In this poster, we empirically study three API forums: Twitter, eBay, and AdWords, to investigate the characteristics of question-answering process. We observe that +60% of the posts on all forums were answered with API method names or documentation. +85% of the questions were answered by API development teams and the answers from API development teams drew fewer follow-up questions. Our results provide empirical evidence in future work to build automated solutions to answer developer questions on API forums.
Divergent forks are a common practice in open-source software development to perform long-term, independent and diverging development on top of a popular source repository. However, keeping such divergent downstream forks in sync with the upstream source evolution poses engineering challenges in terms of frequent merge conflicts. In this work, we conduct the first industrial case study of frequent merges from upstream and the resulting merge conflicts, in the context of Microsoft Edge development. The study consists of two parts. First, we describe the nature of merge conflicts that arise due to merges from upstream. Second, we investigate the feasibility of automatically fixing a class of merge conflicts related to build breaks that consume a significant amount of developer time to root-cause and fix. Towards this end, we have implemented a tool MrgBldBrkFixer and evaluate it on three months of real Microsoft Edge Beta development data, and report encouraging results.
Deep Learning (DL) systems are key enablers for engineering intelligent applications. Nevertheless, using DL systems in safety- and security-critical applications requires to provide testing evidence for their dependable operation. We introduce DeepImportance, a systematic testing methodology accompanied by an Importance-Driven (IDC) test adequacy criterion for DL systems. Applying IDC enables to establish a layer-wise functional understanding of the importance of DL system components and use this information to assess the semantic diversity of a test set. Our empirical evaluation on several DL systems and across multiple DL datasets demonstrates the usefulness and effectiveness of DeepImportance.
The rise in awareness of sustainable software has led to a focus on energy efficiency and consideration of code smells during software development. This eventually requires software engineering teachers to focus on topics such as code smells in their software engineering courses to bring awareness among students on the impact of code smells and bad design choices, not just for the software but also for the environment. Thus, we propose a desktop game named Refactor4Green to teach code smells and refactoring to novice programmers. The core idea of the game is to introduce code smells with refactoring choices through the theme of a green environment. We conducted a preliminary study with university students and got positive feedback from 83.06% of the participants.
To give students as authentic learning experience as possible, many software-focused degrees incorporate team-based capstone projects in the final year of study. Designing capstone projects, however, is not a trivial undertaking, and a number of constraints need to be considered, especially when it comes to defining learning outcomes, choosing clients and projects, providing guidance to students, creating an effective project "support infrastructure", and measuring student outcomes. To address these challenges, we propose a novel, scalable model for managing capstone projects, called ACE, that adapts Spotify's Squads and Tribes organization to an educational setting. We present our motivation, the key components of the model, its adoption, and refer to preliminary observations.
Ethical and social problems of the emerging technology of self-driving cars can best be addressed through an applied engineering ethical approach. However, currently social and ethical problems are typically being presented in terms of an idealized unsolvable decision-making problem, the so-called Trolley Problem. Instead, we propose that ethical analysis should focus on the study of ethics of complex real-world engineering problems. As software plays a crucial role in the control of self-driving cars, software engineering solutions should handle actual ethical and social considerations. We take a closer look at the regulative instruments, standards, design, and implementations of components, systems, and services and we present practical social and ethical challenges that must be met in the ecology of the socio-technological system of self-driving cars which implies novel expectations for software engineering in the automotive industry.