ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings

Full Citation in the ACM Digital Library

DEMONSTRATION SESSION: Demonstrations: Cyber-physical systems

SLEMI: finding simulink compiler bugs through equivalence modulo input (EMI)

Shafiul Azam Chowdhury
Sohil Lal Shrestha
Taylor T. Johnson
Christoph Csallner

This demo presents usage and implementation details of SLEMI. SLEMI is the first tool to automatically find compiler bugs in the widely used cyber-physical system development tool Simulink via Equivalence Modulo Input (EMI). EMI is a recent twist on differential testing that promises more efficiency. SLEMI implements several novel mutation techniques that deal with CPS language features that are not found in procedural languages. This demo also introduces a new EMI-based mutation strategy that has already found a new confirmed bug in Simulink version R2018a. To increase SLEMI's efficiency further, this paper presents parallel generation of random, valid Simulink models. A video demo of SLEMI is available at https://www.youtube.com/watch?v=oliPgOLT6eY.

PROMISE: high-level mission specification for multiple robots

Sergio García
Patrizio Pelliccione
Claudio Menghi
Thorsten Berger
Tomas Bures

Service robots, a type of robots that perform useful tasks for humans, are foreseen to be broadly used in the near future in both social and industrial scenarios. Those robots will be required to operate in dynamic environments, collaborating among them or with users. Specifying the list of requested tasks to be achieved by a robotic team is far from being trivial. Therefore, mission specification languages and tools need to be expressive enough to allow the specification of complex missions (e.g., detailing recovery actions), while being reachable by domain experts who might not be knowledgeable of programming languages. To support domain experts, we developed PROMISE, a Domain-Specific Language that allows mission specification for multiple robots in a user-friendly, yet rigorous manner. PROMISE is built as an Eclipse plugin that provides a textual and a graphical interface for mission specification. Our tool is in turn integrated into a software framework, which provides functionalities as: (1) automatic generation from specification, (2) sending of missions to the robotic team; and (3) interpretation and management of missions during execution time. PROMISE and its framework implementation have been validated through simulation and real-world experiments with four different robotic models.

Video: https://youtu.be/RMtqwY2GOlQ

DEMONSTRATION SESSION: Demonstrations: Web testing

SMRL: a metamorphic security testing tool for web systems

Phu X. Mai
Arda Goknil
Fabrizio Pastore
Lionel C. Briand

We present a metamorphic testing tool that alleviates the oracle problem in security testing. The tool enables engineers to specify metamorphic relations that capture security properties of Web systems. It automatically tests Web systems to detect vulnerabilities based on those relations. We provide a domain-specific language accompanied by an Eclipse editor to facilitate the specification of metamorphic relations. The tool automatically collects the input data and transforms the metamorphic relations into executable Java code in order to automatically perform security testing based on the collected data. The tool has been successfully evaluated on a commercial system and a leading open source system (Jenkins). Demo video: https://youtu.be/9kx6u9LsGxs.

WasmView: visual testing for webassembly applications

Alan Romano
Weihang Wang

WebAssembly is the newest language to arrive on the web. It features a binary code format to serve as a compilation target for native languages such as C, C++, and Rust and allows native applications to be ported for web usage. In the current implementation, WebAssembly requires interaction with JavaScript at a minimum to initialize and additionally to interact with Web APIs. As a result, understanding the function calls between WebAssembly and JavaScript is crucial for testing, debugging, and maintaining applications utilizing this new language. To this end, we develop a tool, WasmView, to visualize function calls made between WebAssembly and JavaScript in a web application. WasmView also records the stack traces and Screenshots of applications. This tool can help in supporting visual testing for interactive applications and assisting refactoring for code updates. The demo video for WasmView can be viewed at https://youtu.be/kjKxL7L7zxI and the source code can be found at https://github.com/wasmview/wasmview.github.io.

DEMONSTRATION SESSION: Demonstrations: Android application testing

AppTestMigrator: a tool for automated test migration for Android apps

Farnaz Behrang
Alessandro Orso

The use of mobile apps is increasingly widespread, and much effort is put into testing these apps to make sure they behave as intended. In this demo, we present AppTestMigrator, a technique and tool for migrating test cases between apps with similar functionality. The intuition behind AppTestMigrator is that many apps share similarities in their functionality, and these similarities often result in conceptually similar user interfaces (through which that functionality is accessed). AppTestMigrator attempts to automatically transform the sequence of events and oracles in a test case for an app (source app) to events and oracles for another app (target app). The results of our preliminary evaluation show the effectiveness of AppTestMigrator in migrating test cases between mobile apps with similar functionality.

Video URL: https://youtu.be/WQnfEcwYqa4

DEMONSTRATION SESSION: Demonstrations: Contracts

Seraph: enabling cross-platform security analysis for EVM and WASM smart contracts

Zhiqiang Yang
Han Liu
Yue Li
Huixuan Zheng
Lei Wang
Bangdao Chen

As blockchain becomes increasingly popular across various industries in recent years, many companies started designing and developing their own smart contract platforms to enable better services on blockchain. While smart contracts are notoriously known to be vulnerable to external attacks, such platform diversity further amplified the security challenge. To mitigate this problem, we designed the very first cross-platform security analyzer called Seraph for smart contracts. Specifically, Seraph enables automated security analysis for different platforms built on two mainstream virtual machine architectures, i.e., EVM and WASM. To this end, Seraph introduces a set of general connector API to abstract interactions between the virtual machine and blockchain, e.g., load and update storage data on blockchain. Moreover, we proposed the symbolic semantic graph to model critical dependencies and decoupled security analysis from contract code as well. Our preliminary evaluation on four existing smart contract platforms demonstrated the potential of Seraph in finding security threats both flexibly and accurately. A video of Seraph is available at https://youtu.be/wxixZkVqUsc.

DEMONSTRATION SESSION: Demonstrations: Software architecture

The SmartSHARK ecosystem for software repository mining

Alexander Trautsch
Fabian Trautsch
Steffen Herbold
Benjamin Ledel
Jens Grabowski

Software repository mining is the foundation for many empirical software engineering studies. The collection and analysis of detailed data can be challenging, especially if data shall be shared to enable replicable research and open science practices. SmartSHARK is an ecosystem that supports replicable and reproducible research based on software repository mining.

DEMONSTRATION SESSION: Demonstrations: Deep learning testing and debugging 1

DeepMutation: a neural mutation tool

Michele Tufano
Jason Kimko
Shiya Wang
Cody Watson
Gabriele Bavota
Massimiliano Di Penta
Denys Poshyvanyk

Mutation testing can be used to assess the fault-detection capabilities of a given test suite. To this aim, two characteristics of mutation testing frameworks are of paramount importance: (i) they should generate mutants that are representative of real faults; and (ii) they should provide a complete tool chain able to automatically generate, inject, and test the mutants. To address the first point, we recently proposed an approach using a Recurrent Neural Network Encoder-Decoder architecture to learn mutants from ~787k faults mined from real programs. The empirical evaluation of this approach confirmed its ability to generate mutants representative of real faults. In this paper, we address the second point, presenting DeepMutation, a tool wrapping our deep learning model into a fully automated tool chain able to generate, inject, and test mutants learned from real faults.

Video: https://sites.google.com/view/learning-mutation/deepmutation

DEMONSTRATION SESSION: Demonstrations: Traceability

TimeTracer: a tool for back in time traceability replaying

Christoph Mayr-Dorn
Michael Vierhauser
Felix Keplinger
Stefan Bichler
Alexander Egyed

Ensuring correct trace links between different types of artifacts (requirements, architecture, or code) is crucial for compliance in safety-critical domains, for consistency checking, or change impact assessment. The point in time when trace links are created, however, (i.e., immediately during development or weeks/months later) has a significant impact on its quality. Assessing quality thus relies on obtaining a historical view on artifacts and their trace links at a certain point in the past which provides valuable insights on when, how, and by whom, trace links were created. This work presents TimeTracer, a tool that allows engineers to go back in time - not just to view the history of artifacts but also the history of trace links associated with these artifacts. TimeTracer allows easy integration with different development support tools such as Jira; and it stores artifacts, traces, and changes thereof in a unified artifact model.

DEMONSTRATION SESSION: Demonstrations: API

SimilarAPI: mining analogical APIs for library migration

Chunyang Chen

Establishing API mappings between libraries is a prerequisite step for library migration tasks. Manually establishing API mappings is tedious due to the large number of APIs to be examined, and existing methods based on supervised learning requires unavailable already-ported or functionality similar applications. Therefore, we propose an unsupervised deep learning based approach to embed both API usage semantics and API description (name and document) semantics into vector space for inferring likely analogical API mappings between libraries. We implement a proof-of-concept website SimilarAPI (https://similarapi.appspot.com) which can recommend analogical APIs for 583,501 APIs of 111 pairs of analogical Java libraries with diverse functionalities. Video: https://youtu.be/EAwD6l24vLQ

DEMONSTRATION SESSION: Demonstrations: Deep learning testing and debugging 2

FeatureNET: diversity-driven generation of deep learning models

Salah Ghamizi
Maxime Cordy
Mike Papadakis
Yves Le Traon

We present FeatureNET, an open-source Neural Architecture Search (NAS) tool¹ that generates diverse sets of Deep Learning (DL) models. FeatureNET relies on a meta-model of deep neural networks, consisting of generic configurable entities. Then, it uses tools developed in the context of software product lines to generate diverse (maximize the differences between the generated) DL models. The models are translated to Keras and can be integrated into typical machine learning pipelines. FeatureNET allows researchers to generate seamlessly a large variety of models. Thereby, it helps choosing appropriate DL models and performing experiments with diverse models (mitigating potential threats to validity). As a NAS method, FeatureNET successfully generates models performing equally well with handcrafted models.

EvalDNN: a toolbox for evaluating deep neural network models

Yongqiang Tian
Zhihua Zeng
Ming Wen
Yepang Liu
Tzu-yang Kuo
Cheung Shing-Chi

Recent studies have shown that the performance of deep learning models should be evaluated using various important metrics such as robustness and neuron coverage, besides the widely-used prediction accuracy metric. However, major deep learning frameworks currently only provide APIs to evaluate a model's accuracy. In order to comprehensively assess a deep learning model, framework users and researchers often need to implement new metrics by themselves, which is a tedious job. What is worse, due to the large number of hyper-parameters and inadequate documentation, evaluation results of some deep learning models are hard to reproduce, especially when the models and metrics are both new.

To ease the model evaluation in deep learning systems, we have developed EvalDNN, a user-friendly and extensible toolbox supporting multiple frameworks and metrics with a set of carefully designed APIs. Using EvalDNN, evaluation of a pre-trained model with respect to different metrics can be done with a few lines of code. We have evaluated EvalDNN on 79 models from TensorFlow, Keras, GluonCV, and PyTorch. As a result of our effort made to reproduce the evaluation results of existing work, we release a performance benchmark of popular models, which can be a useful reference to facilitate future research. The tool and benchmark are available at https://github.com/yqtianust/EvalDNN and https://yqtianust.github.io/EvalDNN-benchmark/, respectively. A demo video of EvalDNN is available at: https://youtu.be/v69bNJN2bJc.

DEMONSTRATION SESSION: Demonstrations: Fuzzing 2

FuRong: fusing report of automated Android testing on multi-devices

Yuanhan Tian
Shengcheng Yu
Chunrong Fang
Peiyuan Li

Automated testing has been widely used to ensure the quality of Android applications. However, incomprehensible testing results make it difficult for developers to understand and fix potential bugs. This paper proposes FuRong, a novel tool, to fuse bug reports of high-readability and strong-guiding-ability via analyzing the automated testing results on multi-devices. FuRong builds a bug model with complete context information, such as screenshots, operation sequences, and logs from multi-devices, and then leverages pre-trained Decision Tree classifier (with 18 bug category labels) to classify bugs. FuRong deduplicates the classified bugs via Levenshtein distance and finally generates the easy-to-understand report, not only context information of bugs, where possible causes and fix suggestions for each bug category are also provided. An empirical study of 8 open-source Android applications with automated testing on 20 devices has been conducted, the results show the effectiveness of FuRong, which has a bug classification precision of 93.4% and a bug classification accuracy of 87.9%. Video URL: https://youtu.be/LUkFTc32B6k

DEMONSTRATION SESSION: Demonstrations: Static analysis 2

Phoenix: a tool for automated data-driven synthesis of repairs for static analysis violations

Hiroaki Yoshida
Rohan Bavishi
Keisuke Hotta
Yusuke Nemoto
Mukul R. Prasad
Shinji Kikuchi

One of the major drawbacks of traditional automatic program repair (APR) techniques is their dependence on a test suite as a repair specification. In practice, it is often hard to obtain specification-quality test suites. This limits the performance and hence the viability of such test-suite-based approaches. On the other hand, static-analysis-based bug finding tools are increasingly being adopted in industry but still facing challenges since the reported violations are viewed as not easily actionable. In previous work, we proposed a novel technique that solves both these challenges through a technique for automatically generating high-quality patches for static analysis violations by learning from previous repair examples. In this paper, we present a tool Phoenix, implementing this technique. We describe the architecture, user interfaces, and salient features of Phoenix, and specific practical use cases of its technology. A video demonstrating Phoenix is available at https://phoenix-tool.github.io/demo-video.html.

DEMONSTRATION SESSION: Demonstrations: Big data

VITALSE: visualizing eye tracking and biometric data

Devjeet Roy
Sarah Fakhoury
Venera Arnaoudova

Recent research in empirical software engineering is applying techniques from neurocognitive science and breaking new grounds in the ways that researchers can model and analyze the cognitive processes of developers as they interact with software artifacts. However, given the novelty of this line of research, only one tool exists to help researchers represent and analyze this kind of multi-modal biometric data. While this tool does help with visualizing temporal eyetracking and physiological data, it does not allow for the mapping of physiological data to source code elements, instead projecting information over images of code. One drawback of this is that researchers are still unable to meaningfully combine and map physiological and eye tracking data to source code artifacts. The use of images also bars the support of long or multiple code files, which prevents researchers from analyzing data from experiments conducted in realistic settings. To address these drawbacks, we propose VITALSE, a tool for the interactive visualization of combined multi-modal biometric data for software engineering tasks. VITALSE provides interactive and customizable temporal heatmaps created with synchronized eyetracking and biometric data. The tool supports analysis on multiple files, user defined annotations for points of interest over source code elements, and high level customizable metric summaries for the provided dataset. VITALSE, a video demonstration, and sample data to demonstrate its capabilities can be found at http://www.vitalse.app.

DEMONSTRATION SESSION: Demonstrations: Symbolic execution

BigTest: a symbolic execution based systematic test generation tool for Apache spark

Muhammad Ali Gulzar
Madanlal Musuvathi
Miryung Kim

Data-intensive scalable computing (DISC) systems such as Google's MapReduce, Apache Hadoop, and Apache Spark are prevalent in many production services. Despite their popularity, the quality of DISC applications suffers due to a lack of exhaustive and automated testing. Current practices of testing DISC applications are limited to using a small random sample of the entire input dataset which merely exposes any program faults. Unlike SQL queries, testing DISC applications has new challenges due to a composition of both dataflow and relational operators, and user-defined functions (UDF) that could be arbitrarily long and complex.

To address this problem, we demonstrate a new white-box testing framework called BigTest that takes an Apache Spark program as input and automatically generates synthetic, concrete data for effective and efficient testing. BigTest combines the symbolic execution of UDFs with the logical specifications of dataflow and relational operators to explore all paths in a DISC application. Our experiments show that BigTest is capable of generating test data that can reveal up to 2X more faults than the entire data set with 194X less testing time. We implement BigTest in a Java-based command line tool with a pre-compile binary jar. It exposes a configuration file in which a user can edit preferences, including the path of a target program, the upper bound of loop exploration, and a choice of theorem solver. The demonstration video of BigTest is available at https://youtu.be/OeHhoKiDYso and BigTest is available at https://github.com/maligulzar/BigTest.

PG-KLEE: trading soundness for coverage

Richard Rutledge
Alessandro Orso

Comprehensive test inputs are an essential ingredient for dynamic software analysis techniques, yet are typically impossible to obtain and maintain. Automated input generation techniques can supplant manual effort in many contexts, but they also exhibit inherent limitations in practical applications. Therefore, the best approach to input generation for a given application task necessarily entails compromise. Most symbolic execution approaches maintain soundness by sacrificing completeness. In this paper, we take the opposite approach and demonstrate PG-KLEE, an input generation tool that over-approximates program behavior to achieve complete coverage. We also summarize some empirical results that validate our claims. Our technique is detailed in an earlier paper [16], and the source code of PG-KLEE is available from [2].

Video URL: https://youtu.be/b1ajzW6YWds

DEMONSTRATION SESSION: Demonstrations: Testing 1

RTj: a Java framework for detecting and refactoring rotten green test cases

Matias Martinez
Anne Etien
Stéphane Ducasse
Christopher Fuhrman

Rotten green tests are passing tests which have at least one assertion that is not executed. They give developers a false sense of trust in the code. In this paper, we present RTj, a framework that analyzes test cases from Java projects with the goal of detecting and refactoring rotten test cases. RTj automatically discovered 418 rotten tests from 26 open-source Java projects hosted on GitHub. Using RTj, developers have an automated recommendation of the tests that need to be modified for improving the quality of the applications under test. A video is available at: https://youtu.be/Uqxf-Wzp3Mg

GeekyNote: a technical documentation tool with coverage, backtracking, traces, and couplings

Yung-Pin Cheng
Wei-Nien Hsiung
Yu-Shan Wu
Li-Hsuan Chen

Understanding an unfamiliar program is always a daunting task for any programmer, either experienced or inexperienced. Many studies have shown that even an experienced programmer who is already familiar with the code may still need to rediscover the code frequently during software maintenance. The difficulties of program comprehension is much more intense when a system is completely new. One well-known solution to this notorious problem is to create effective technical documentation to make up for the lack of knowledge.

The purpose of technical documentation is to achieve the transfer of knowledge. However, creating effective technical documentation has been impeded by many problems in practice [1]. In this paper, we propose a novel tool called GeekyNote to address the major challenges in technical documentation. The key ideas GeekyNote proposes are: (1) documents are annotated to versioned source code transparently; (2) formal textual writings are discouraged and screencasts (or other forms of documents) are encouraged; (3) the up-to-dateness between documents and code can be detected, measured, and managed; (4) the documentation that works like a debugging-trace is supported; (5) couplings can be easily created and managed for future maintenance needs; (6) how good a system is documented can be measured. A demo video can be accessed at https://youtu.be/cBueuPVDgWM.

DEMONSTRATION SESSION: Demonstrations: Android

DroidMutator: an effective mutation analysis tool for Android applications

Jian Liu
Xusheng Xiao
Lihua Xu
Liang Dou
Andy Podgurski

With the rapid growth of Android devices, techniques that ensure high quality of mobile applications (i.e., apps) are receiving more and more attention. It is well-accepted that mutation analysis is an effective approach to simulate and locate realistic faults in the program. However, there exist few practical mutation analysis tools for Android apps. Even worse, existing mutation analysis tools tend to generate a large number of mutants, hindering broader adoption of mutation analysis, let alone the remaining high number of stillborn mutants. Additionally, mutation operators are usually pre-defined by such tools, leaving users less ability to define specific operators to meet their own needs. To address the aforementioned problems, we propose DroidMutator, a mutation analysis tool specifically for Android apps with configurability and extensibility. DroidMutator reduces the number of generated stillborn mutants through type checking, and the scope of mutation operators can be customized so that it only generates mutants in specific code blocks, thus generating fewer mutants with more concentrated purposes. Furthermore, it allows users to easily extend their mutation operators. We have applied DroidMutator on 50 open source Android apps and our experimental results show that DroidMutator effectively reduces the number of stillborn mutants and improves the efficiency of mutation analysis.

Demo link: https://github.com/SQS-JLiu/DroidMutator

Video link: https://youtu.be/dtD0oTVioHM

DEMONSTRATION SESSION: Demonstrations: Meta studies

An SLR-tool: search process in practice: a tool to conduct and manage systematic literature review (SLR)

Andreas Hinderks
Francisco José Domínguez Mayo
Jörg Thomaschewski
María José Escalona

Systematic Literature Reviews (SLRs) have established themselves as a method in the field of software engineering. The aim of an SLR is to systematically analyze existing literature in order to answer a research question. In this paper, we present a tool to support an SLR process. The main focus of the SLR tool (https://www.slr-tool.com/) is to create and manage an SLR project, to import search results from search engines, and to manage search results by including or excluding each paper. A demo video of our SLR tool is available at https://youtu.be/Jan8JbwiE4k.

DEMONSTRATION SESSION: Demonstrations: Performance

Nimbus: improving the developer experience for serverless applications

Robert Chatley
Thomas Allerton

We present Nimbus, a framework for writing and deploying Java applications on a Function-as-a-Service ("serverless") platform. Nimbus aims to soothe four main pain points experienced by developers working on serverless applications: that testing can be difficult, that deployment can be a slow and painful process, that it is challenging to avoid vendor lock-in, and that long cold start times can introduce unwelcome latency to function invocations.

Nimbus provides a number of features that aim to overcome these challenges when working with serverless applications. It uses an annotation-based configuration to avoid having to work with large configuration files. It aims to allow the code written to be cloud-agnostic. It provides an environment for local testing where the complete application can be run locally before deployment. Lastly, Nimbus provides mechanisms for optimising the contents and size of the artifacts that are deployed to the cloud, which helps to reduce both deployment times and cold start times.

Video: https://www.youtube.com/watch?v=0nYchh8jdY4

DEMONSTRATION SESSION: Demonstrations: Software verification

mCoq: mutation analysis for Coq verification projects

Kush Jain
Karl Palmskog
Ahmet Celik
Emilio Jesús Gallego Arias
Milos Gligoric

Software developed and verified using proof assistants, such as Coq, can provide trustworthiness beyond that of software developed using traditional programming languages and testing practices. However, guarantees from formal verification are only as good as the underlying definitions and specification properties. If properties are incomplete, flaws in definitions may not be captured during verification, which can lead to unexpected system behavior and failures. Mutation analysis is a general technique for evaluating specifications for adequacy and completeness, based on making small-scale changes to systems and observing the results. We demonstrate mCoq, the first mutation analysis tool for Coq projects. mCoq changes Coq definitions, with each change producing a modified project version, called a mutant, whose proofs are exhaustively checked. If checking succeeds, i.e., the mutant is live, this may indicate specification incompleteness. Since proof checking can take a long time, we optimized mCoq to perform incremental and parallel processing of mutants. By applying mCoq to popular Coq libraries, we found several instances of incomplete and missing specifications manifested as live mutants. We believe mCoq can be useful to proof engineers and researchers for analyzing software verification projects. The demo video for mCoq can be viewed at: https://youtu.be/QhigpfQ7dNo.

MPI-SV: a symbolic verifier for MPI programs

Zhenbang Chen
Hengbiao Yu
Xianjin Fu
Ji Wang

Message passing is the primary programming paradigm in high-performance computing. However, developing message passing programs is challenging due to the non-determinism caused by parallel execution and complex programming features such as non-deterministic communications and asynchrony. We present MPI-SV, a symbolic verifier for verifying the parallel C programs using message passing interface (MPI). MPI-SV combines symbolic execution and model checking in a synergistic manner to improve the scalability and enlarge the scope of verifiable properties. We have applied MPI-SV to real-world MPI C programs. The experimental results indicate that MPI-SV can, on average, achieve 19x speedups in verifying deadlock-freedom and 5x speedups in finding counter-examples. MPI-SV can be accessed at https://mpi-sv.github.io, and the demonstration video is at https://youtu.be/zzCY0CPDNCw.

DEMONSTRATION SESSION: Demonstrations: Testing 2

DCO analyzer: local controllability and observability analysis and enforcement of distributed test scenarios

Bruno Lima
João Pascoal Faria

To ensure interoperability and the correct behavior of heterogeneous distributed systems in key scenarios, it is important to conduct automated integration tests, based on distributed test components (called local testers) that are deployed close to the system components to simulate inputs from the environment and monitor the interactions with the environment and other system components. We say that a distributed test scenario is locally controllable and locally observable if test inputs can be decided locally and conformance errors can be detected locally by the local testers, without the need for exchanging coordination messages between the test components during test execution (which may reduce the responsiveness and fault detection capability of the test harness). DCO Analyzer is the first tool that checks if distributed test scenarios specified by means of UML sequence diagrams exhibit those properties, and automatically determines a minimum number of coordination messages to enforce them.

The demo video for DCO Analyzer can be found at https://youtu.be/LVIusK36_bs.

SESSION: ACM student research competition

Uncertainty-guided testing and robustness enhancement for deep learning systems

Xiyue Zhang

Deep learning (DL) systems, though being widely used, still suffer from quality and reliability issues. Researchers have put many efforts to investigate these issues. One promising direction is to leverage uncertainty, an intrinsic characteristic of DL systems when making decisions, to better understand their erroneous behavior. DL system testing is an effective method to reveal potential defects before the deployment into safety- and security-critical applications. Various techniques and criteria have been designed to generate defect-triggers, i.e. adversarial examples (AEs). However, whether these test inputs could achieve a full spectrum examination of DL systems remains unknown and there still lacks understanding of the relation between AEs and DL uncertainty. In this work, we first conduct an empirical study to uncover the characteristics of AEs from the perspective of uncertainty. Then, we propose a novel approach to generate inputs that are missed by existing techniques. Further, we investigate the usefulness and effectiveness of the data for DL robustness enhancement.

Evaluation of brain activity while pair programming

Ananga Thapaliya

In this research, we investigate the effect of pair programming on the mind of software developers using data coming from EEG and how it effects the overall outcome of their task. For this research, we use EEG device to measure the brain-behavior relations of the developer and analyze the electromagnetic waves using ERD and correlation. We measure the concentration level, either it is high or low under three different cases: solo programming, pair programming (navigator) and pair programming (driver). The preliminary results of the analysis of pair programming confirms the higher concentration level as compared to solo programming.

Machine translation testing via pathological invariance

Shashij Gupta

Due to the rapid development of deep neural networks, in recent years, machine translation software has been widely adopted in people's daily lives, such as communicating with foreigners or understanding political news from the neighbouring countries. However, machine translation software could return incorrect translations because of the complexity of the underlying network. To address this problem, we introduce a novel methodology called PaInv for validating machine translation software. Our key insight is that sentences of different meanings should not have the same translation (i.e., pathological invariance). Specifically, PaInv generates syntactically similar but semantically different sentences by replacing one word in the sentence and filter out unsuitable sentences based on both syntactic and semantic information. We have applied PaInv to Google Translate using 200 English sentences as input with three language settings: English→Hindi, English→Chinese, and English→German. PaInv can accurately find 331 pathological invariants in total, revealing more than 100 translation errors.

Automatic generation of simulink models to find bugs in a cyber-physical system tool chain using deep learning

Sohil Lal Shrestha

Testing cyber-physical system (CPS) development tools such as MathWorks' Simulink is very important as they are widely used in design, simulation, and verification of CPS data-flow models. Existing randomized differential testing frameworks such as SLforge leverages semi-formal Simulink specifications to guide random model generation which requires significant research and engineering investment along with the need to manually update the tool, whenever MathWorks updates model validity rules. To address the limitations, we propose to learn validity rules automatically by learning a language model using our framework DeepFuzzSL from existing corpus of Simulink models. In our experiments, DeepFuzzSL consistently generate over 90% valid Simulink models and also found 2 confirmed bugs by MathWorks Support.

Playing with your project data in scrum retrospectives

Christoph Matthies

Modern, agile software development methods rely on iterative work and improvement cycles to deliver their claimed benefits. In Scrum, the most popular agile method, process improvement is implemented through regular Retrospective meetings. In these meetings, team members reflect on the latest development iteration and decide on improvement actions. To identify potential issues, data on the completed iteration needs to be gathered. The Scrum method itself does not prescribe these steps in detail. However, Retrospective games, i.e. interactive group activities, have been proposed to encourage the sharing of experiences and problems. These activities mostly rely on the collected perceptions of team members. However, modern software development practices produce a large variety of digital project artifacts, e.g. commits in version control systems or test run results, which contain detailed information on performed teamwork. We propose taking advantage of this information in new, data-driven Retrospective activities, allowing teams to gain additional insights based on their own team-specific data.

An empirical study of the first contributions of developers to open source projects on GitHub

Vikram N. Subramanian

The popularity of Open Source Software (OSS) is at an all-time high and for it to remain so it is vital for new developers to continually join and contribute to the OSS community. In this paper, to better understand the first time contributor, we study the characteristics of the first pull request (PR) made to an OSS project by developers. We mine GitHub for the first OSS PR of 3501 developers to study certain characteristics of PRs like language and size. We find that over 1/3rd of the PRs were in Java while C++ was very unpopular. A large fraction of PRs didn't even involve writing code, and were a mixture of trivial and non-trivial changes.

Stress testing SMT solvers via type-aware mutation

Chengyu Zhang

This paper introduces type-aware mutation, a simple, but effective methodology for stress testing Satisfiability Modulo Theories (SMT) solvers. The key idea is mutating the operators of the formula to generate test inputs for differential testing, while considering the types of the operators to ensure the mutants are still valid. The realization of type-aware mutation was evaluated on finding bugs in two state-of-the-art SMT solvers, Z3 and CVC4. During the three months of empirical evaluation, 101 unique, confirmed bugs were found by type-aware mutation, and 87 of them have been fixed. The testing efforts and bugs were well-appreciated by the developers.

The role of egocentric bias in undergraduate Agile software development teams

Frederike Ramin

The egocentric bias describes the tendency to value one's own input and perspective higher than that of others. This phenomenon impacts collaboration and teamwork. However, current research on the subject concerning modern software development is lacking. We conducted a case study of 26 final year software engineering students and collected the perceptions of individual contributions to team efforts through regular surveys. We report evidence of an egocentric bias in engineering team members, which decreased over time. In contrast, we found no in-group bias, i.e. favoritism regarding contributions of own team members. We discuss our initial analyses and results, which we hypothesize can be explained by group cohesiveness as well as non-competition and group similarity, respectively.

Studying and suggesting logging locations in code blocks

Zhenhao Li

Developers write logging statements to generate logs and record system execution behaviors to assist in debugging and software maintenance. However, there exists no practical guidelines on where to write logging statements. On one hand, adding too many logging statements may introduce superfluously trivial logs and performance overheads. On the other hand, logging too little may miss necessary runtime information. Thus, properly deciding the logging location is a challenging task and a finer-grained understanding of where to write logging statements is needed to assist developers in making logging decisions. In this paper, we conduct a comprehensive study to uncover guidelines on logging locations at the code block level. We analyze logging statements and their surrounding code by combining both deep learning techniques and manual investigations. From our preliminary results, we find that our deep learning models achieve over 90% in precision and recall when trained using the syntactic (e.g., nodes in abstract syntax tree) and semantic (e.g., variable names) features. However, cross-system models trained using semantic features only have 45.6% in precision and 73.2% in recall, while models trained using syntactic features still have over 90% precision and recall. Our current progress highlights that there is an implicit syntactic logging guideline across systems, and such information may be leveraged to uncover general logging guidelines.

Exploring the relationship between dockerfile quality and project characteristics

Yiwen Wu

Dockerfile plays an important role in the Docker-based software development process, but many Dockerfile codes are infected with quality issues in practice. Previous empirical studies showed the existence of association between code quality and project characteristics. However, the relationship between Dockerfile quality and project characteristics has never been explored. In this paper, we seek to empirically study this relation through a large dataset of 6,334 projects. Using linear regression analysis, when controlled for various variables, we statistically identify and quantify the relationship between Dockerfile quality and project characteristics.

Hanging by the thread: an empirical study of donations in open source

Cassandra Overney

Open source plays a critical role in our software infrastructure. It is used in the creation of almost every product and makes it increasingly easy to create powerful software cheaply and quickly, which many companies benefit from. However, its importance and our dependence on it, are often not recognized [2]. Like all software projects, open source needs maintenance to fix bugs and adapt code to evolving technologies [12]. With increasing popularity, demands for maintenance and support work also rise, resulting in many requests and reported issues. How to supply all of the needed maintenance and development work is an open and sometimes controversial question.

An automated framework for gaming platform to test multiple games

Zihe Song

Game testing is a necessary but challenging task for gaming platforms. Current game testing practice requires significant manual effort. In this paper, we proposed an automated game testing framework combining adversarial inverse reinforcement learning algorithm with evolutionary multi-objective optimization. This framework aims to help gaming platform to assure market-wide game qualities as the framework is suitable to test different games with minimum manual customization for each game.

Improving bug detection and fixing via code representation learning

Yi Li

The software quality and reliability have been proved to be important during the program development. There are many existing studies trying to help improve it on bug detection and automated program repair processes. However, each of them has its own limitation and the overall performance still have some improvement space. In this paper, we proposed a deep learning framework to improve the software quality and reliability on these two detect-fix processes. We used advanced code modeling and AI models to have some improvements on the state-of-the-art approaches. The evaluation results show that our approach can have a relative improvement up to 206% in terms of F-1 score when comparing with baselines on bug detection and can have a relative improvement up to 19.8 times on the correct bug-fixing amount when comparing with baselines on automated program repair. These results can prove that our framework can have an outstanding performance on improving software quality and reliability in bug detection and automated program repair processes.

Automated analysis of inter-parameter dependencies in web APIs

Alberto Martin-Lopez

Web services often impose constraints that restrict the way in which two or more input parameters can be combined to form valid calls to the service, i.e. inter-parameter dependencies. Current web API specification languages like the OpenAPI Specification (OAS) provide no support for the formal description of such dependencies, making it hardly possible to interact with the services without human intervention. We propose specifying and automatically analyzing inter-parameter dependencies in web APIs. To this end, we propose a domain-specific language to describe these dependencies, a constraint programming-aided tool supporting their automated analysis, and an OAS extension integrating our approach and easing its adoption. Together, these contributions open a new range of possibilities in areas such as source code generation and testing.

Detection and mitigation of JIT-induced side channels

Tegan Brennan

Cyber-attacks stealing confidential information are becoming increasingly frequent and devastating as modern software systems store and manipulate greater amounts of sensitive data. Leaking information about private user data, such as the financial and medical records of individuals, trade secrets of companies and military secrets of states can have drastic consequences. Confidentiality of such private data is critical for users of these systems. Many software development practices, such as the encryption of packages sent over a network, aim to protect the confidentiality of private data by ensuring that an observer is unable to learn anything meaningful about a program's secret input from its public output. Under these protections, the software system's main communication channels, such as the content of the network packets it sends, or the output it writes to a public file, should not leak information about the private data. However, many software systems still contain serious security vulnerabilities. Side channels are an important class of information leaks where secret information can be captured through the observation of non-functional side effects of software systems. Potential side channels include those in execution time, memory usage, size and timings of network packets, and power consumption. Although side-channel vulnerabilities due to hardware (such as vulnerabilities that exploit the cache behavior) have been extensively studied [1, 2, 10, 13, 15-17, 19, 23], software side channels have only recently become an active area of research, including recent results on software side-channel detection [4, 8, 11, 12, 18, 22, 24] and quantification [5, 20, 21], and my own work on a static analysis framework for detection of software side-channels called CoCo-Channel [8]) and a constraint caching framework to accelerate side-channel quantification called Cashew [9].

Does fixing bug increase robustness in deep learning?

Rangeet Pan

Deep Learning (DL) based systems are utilized vastly. Developers update the code to fix the bugs in the system. How these code fixing techniques impacts the robustness of these systems has not been clear. Does fixing code increase the robustness? Do they deteriorate the learning capability of the DL based systems? To answer these questions, we studied 321 Stack Overflow posts based on a published dataset. In this study, we built a classification scheme to analyze how bug-fixes changed the robustness of the DL model and found that most of the bug-fixes can increase the robustness. We also found evidence of bug-fixing that decrease the robustness. Our preliminary result suggests that 12.5% and 2.4% of the bug-fixes in Stack Overflow posts caused the increase and the decrease of the robustness of DL models, respectively.

An empirical study on the evolution of test smell

Dong Jae Kim

Test smell as analogous to code smell is a poor design choice in the implementation of test code. Recently, the concept of test smell has become the utmost interest of researchers and practitioners. Surveys show that developers' are aware of test smells and their potential consequences in the software system. However, there is limited empirical evidence for how developers address test smells during software evolution. Thus, in this paper, we study 2 research questions: (RQ1) How do test smells evolve? (RQ2) What is the motivation for removing test smells? Our result shows that Assertion Roulette, Conditional Test Logic and Unknown tests have a high rate of churns, the feature addition and improvement motivate refactoring, but test smell persists, implicating sub-optimal practice. In our study, we hope to fill the gap between academia and industry by providing evidence of sub-optimal practice in the way developers address test smells, and how it may be detrimental to the software.

Efficient test execution in end to end testing: resource optimization in end to end testing through a smart resource characterization and orchestration

Cristian Augusto

Virtualization and containerization have been two disruptive technologies in the past few years. Both technologies have allowed isolating the applications with fewer resources and have impacted fields such as Software Testing. In the field of testing, the execution of the containerized/virtualized test suite has achieved great savings, but when the complexity increases or the cost of deployment rises, there are open challenges like the efficient execution of End to End (E2E) test suites. This paper proposes a research problem and a feasible solution that looks to improve resource usage in the E2E tests, towards smart resource identification and a proper organization of its execution in order to achieve efficient and effective resource usage. The resources are characterized by a series of attributes that provide information about the resource and its usage during the E2E testing phase. The test cases are grouped and scheduled with the resources (i.e. parallelized in the same machine or executed in a fixed arrangement), achieving an efficient test suite execution and reducing its total cost/time.

Towards automated migration for blockchain-based decentralized application

Xiufeng Xu

Blockchain-based decentralized application is becoming more widely accepted because it publicly runs on the blockchain and cannot be modified implicitly. However, the fact that only a few developers can master both blockchain and front-end programming skills results in the error-prone DApps especially when smart contracts has undergone a migration. The existing techniques rarely pay attention to DApps' automated migration. In this paper, we first summarized 6 migration categories and proposed an approach to figure out where changes are and its categories. Besides, we designed a function call graph structure to ensure mapping relationship accurate and compared it with distinctions between two versions of ABI to offer revising suggestions. We have developed an automated tool to implement our approach in real-world DApps and acquired positive preliminary evaluation results which illustrated the practical value in realizing DApps' automated migration.

SESSION: Doctoral symposium

Skill-based engineering in industrial automation domain: skills modeling and orchestration

Kirill Dorofeev

Software engineering in the industrial automation domain requires generic methods to keep the development complexity at an acceptable level. However, nowadays various PLC vendors use different dialects of the standardized programming languages in their tools, which hinders the re-usability and interoperability across the platforms. The service-oriented approaches can serve to overcome interoperability issues. In distributed control systems, the functionality of an automation component can be offered to the other parties that constitute a production system via a standardized interface, easing the orchestration of the whole system. This paper proposes such a generic interface that hides away the low-level implementation details of a particular functionality and provides a common semantic model for the execution. Further, we show how using such an interface can help to support and automate the overall engineering process, combining the functionality of different components to fulfill a production task. The reference implementation of the proposed concept was used in an industrial demonstrator, which shows the benefits in the system flexibility due to components interoperability and re-usability compared to the traditional control approaches.

Scalable and approximate program dependence analysis

Seongmin Lee

Program dependence is a fundamental concept to many software engineering tasks, yet the traditional dependence analysis struggles to cope with common modern development practices such as multi-lingual implementations and use of third-party libraries. While Observation-based Slicing (ORBS) solves these issues and produces an accurate slice, it has a scalability problem due to the need to build and execute the target program multiple times. We would like to propose a radical change of perspective: a useful dependence analysis needs to be scalable even if it approximates the dependency. Our goal is a scalable approximate program dependence analysis via estimating the likelihood of dependence. We claim that 1) using external information such as lexical analysis or a development history, 2) learning dependence model from partial observations, and 3) merging static, and observation-based approach would assist the proposition. We expect that our technique would introduce a new perspective of program dependence analysis into the likelihood of dependence. It would also broaden the capability of the dependence analysis towards large and complex software.

The effects of required security on software development effort

Elaine Venson

Problem: developers are increasingly adopting security practices in software projects in response to cyber threats. Despite the additional effort required to perform those practices, current cost models either do not consider security as an input or were not properly validated with empirical data. Hypothesis: increasing degrees of application of security practices and security features, motivated by security risks, lead to growing levels of added software development effort. Such an effort increase can be quantified through a parametric model that takes as input the usage degrees of security practices and requirements and outputs the additional software development effort. Contributions: the accurate prediction of secure software development effort will support the provision of a proper amount of resources to projects. We also expect that the quantification of the security effort will contribute to advance research on the cost-effectiveness of software security.

Towards greener Android application development

Hina Anwar

Empirical studies have shown that mobile applications that do not drain battery usually get good ratings from users. To make mobile application energy efficient many studies have been published that present refactoring guidelines and tools to optimize the code. However, these guidelines cannot be generalized w.r.t energy efficiency, as there is not enough energy related data for every context. Existing energy enhancement tools/profilers are mostly prototypes applicable to only a small subset of energy related problems. In addition, the existing guidelines and tools mostly address the energy issues once they have already been introduced. My goal is to add to the existing energy related data by evaluating the energy consumption of various code smell refactorings and third-party libraries used in Android development. Data from such evaluations could provide generalized contextual guidelines that could be used during application development to prevent the introduction of energy related problems. I also aim to develop a support tool for the Android Studio IDE that could give meaningful recommendations to developers during development to make the application code more energy efficient.

Towards DO-178C certification of adaptive learning UAV agents designed with a cognitive architecture

John Pyrgies

Adaptive and Learning Agents (ALAs) bring computational intelligence to their Cyber Physical host systems to adapt to novel situations encountered in their complex operational environment. They do so by learning from their experience to improve their performance. RTCA DO-178C specifies a stringent certification process for airborne software which represents several challenges when applied to an ALA in regards of functional completeness, functional correctness, testability and adaptability. This research claims that it is possible to certify an Adaptive Learning Unmanned Aerial Vehicle (UAV) Agent designed as per a Cognitive Architecture with current DO-178C certification process when leveraging a qualified tool (DO-330), Model-Based Development and Verification (DO-331) and Formal Methods (DO-333). The research consists in developing, as a case study, an ALA embedded in a UAV aimed at neutralizing rogue UAVs in the vicinity of civil airports and test it in the field. This article is the plan to complete, by end 2022, a dissertation currently in its confirmation phase.

Bridging the divide between API users and API developers by mining public code repositories

Maxime Lamothe

Software application programming interfaces (APIs) are a ubiquitous part of Software Engineering. The evolution of these APIs requires constant effort from their developers and users alike. API developers must constantly balance keeping their products modern whilst keeping them as stable as possible. Meanwhile, API users must continually be on the lookout to adapt to changes that could break their applications. As APIs become more numerous, users are challenged by a myriad of choices and information on which API to use. Current research attempts to provide automatic documentation, code examples, and code completion to make API evolution more scalable for users. Our work will attempt to establish practical and scalable API evolution guidelines and tools based on public code repositories, to aid both API users and API developers.

This thesis focuses on investigating the use of public code repositories provided by the open-source community to improve software API engineering practices. More specifically, I seek to improve software engineering practices linked to API evolution, both from the perspective of API users and API developers. To achieve this goal, I will apply quantitative and qualitative research methods to understand the problems at hand. I will then mine public code repositories to develop novel solutions to these problems.

Refactoring operations Grounded in manual code changes

Anna Maria Eilertsen

Refactoring tools automate tedious and error-prone source code changes. The prevalence and difficulty of refactorings in software development makes this a high-impact area for successful automation of manual operations. Automated refactorings tools can improve the speed and accuracy of software development and are easily accessible in many programming environments. Even so, developers frequently eschew automation in favor of manual refactoring and cite reasons like lack of support for real usage scenarios and unpredictable tools. In this paper, we propose to redesign refactoring operations into transformations that are useful and applicable in real software evolution scenarios with the help of repository mining and user studies.

A composed technical debt identification methodology to predict software vulnerabilities

Ruşen Halepmollasi

Technical debt (TD), its impact on development and its consequences such as defects and vulnerabilities, are of common interest and great importance to software researchers and practitioners. Although there exist many studies investigating TD, the majority of them focuses on identifying and detecting TD from a single stage of development. There are also studies that analyze vulnerabilities focusing on some phases of the life cycle. Moreover, several approaches have investigated the relationship between TD and vulnerabilities, however, the generalizability and validity of findings are limited due to small dataset. In this study, we aim to identify TD through multiple phases of development, and to automatically measure it through data and text mining techniques to form a comprehensive feature model. We plan to utilize neural network based classifiers that will incorporate evolutionary changes on TD measures into predicting vulnerabilities. Our approach will be empirically assessed on open source and industrial projects.

Variability aware requirements reuse analysis

Muhammad Abbas

Problem: The goal of a software product line is to aid quick and quality delivery of software products, sharing common features. Effectively achieving the above-mentioned goals requires reuse analysis of the product line features. Existing requirements reuse analysis approaches are not focused on recommending product line features, that can be reused to realize new customer requirements. Hypothesis: Given that the customer requirements are linked to product line features' description satisfying them: then the customer requirements can be clustered based on patterns and similarities, preserving the historic reuse information. New customer requirements can be evaluated against existing customer requirements and reuse of product line features can be recommended. Contributions: We treated the problem of feature reuse analysis as a text classification problem at the requirements-level. We use Natural Language Processing and clustering to recommend reuse of features based on similarities and historic reuse information. The recommendations can be used to realize new customer requirements.

KNOCAP: capturing and delivering important design bits in whiteboard design meetings

Adriana Meza Soria

It is well known that it is desirable to capture the most essential parts of software design meetings that take place at the whiteboard. It is equally well known, however, that actual capture rarely takes place. A few photos may be taken, informal notes might be scribbled down, and at best one of the developers may be tasked with creating a summary. Regardless, problems persist with important information being lost and, even when information is captured, that information not being easily located and accessed. To address these problems, I propose to design and evaluate a novel suite of tools that enables software designers working at the whiteboard to: (1) efficiently and in-the-moment capture important information produced during that meeting, and (2) be delivered, either by request or proactively by the tools, relevant information captured in the past when it is needed in a future design meeting.

Towards providing automated supports to developers on writing logging statements

Zhenhao Li

Developers write logging statements to generate logs and record system execution behaviors. Such logs are widely used for a variety of tasks, such as debugging, testing, program comprehension, and performance analysis. However, there exists no practical guidelines on how to write logging statements; hence, making the logging decision a very challenging task. There are two main challenges that developers are facing while making logging decisions: 1) Difficult to accurately and succinctly record execution behaviors; and 2) Hard to decide where to write logging statements. This thesis proposes a series of approaches to address the problems and help developers make logging decisions in two aspects: assist in making decisions on logging contents and on logging locations. Through case studies on large-scale open source and commercial systems, we anticipate that our study will provide useful suggestions and supports to developers for writing better logging statements.

AI-driven web API testing

Alberto Martin-Lopez

Testing of web APIs is nowadays more critical than ever before, as they are the current standard for software integration. A bug in an organization's web API could have a huge impact both internally (services relying on that API) and externally (third-party applications and end users). Most existing tools and testing approaches require writing tests or instrumenting the system under test (SUT). The main aim of this dissertation is to take web API testing to an unprecedented level of automation and thoroughness. To this end, we plan to apply artificial intelligence (AI) techniques for the autonomous detection of software failures. Specifically, the idea is to develop intelligent programs (we call them "bots") capable of generating hundreds, thousands or even millions of test inputs and to evaluate whether the test outputs are correct based on: 1) patterns learned from previous executions of the SUT; and 2) knowledge gained from analyzing thousands of similar programs. Evaluation results of our initial prototype are promising, with bugs being automatically detected in some real-world APIs.

Performance regression detection in DevOps

Jinfu Chen

Performance is an important aspect of software quality. The goals of performance are typically defined by setting upper and lower bounds for response time and throughput of a system and physical level measurements such as CPU, memory and I/O. To meet such performance goals, several performance-related activities are needed in development (Dev) and operations (Ops). In fact, large software system failures are often due to performance issues rather than functional bugs. One of the most important performance issues is performance regression. Although performance regressions are not all bugs, they often have a direct impact on users' experience of the system. The process of detection of performance regressions in development and operations is faced with challenges. First, the detection of performance regression is conducted after the fact, i.e., after the system is built and deployed in the field or dedicated performance testing environments. Large amounts of resources are required to detect, locate, understand and fix performance regressions at such a late stage in the development cycle. Second, even we can detect a performance regression, it is extremely hard to fix it because other changes are applied to the system after the introduction of the regression. These challenges call for further in-depth analyses of the performance regression. In this dissertation, to avoid performance regression slipping into operation, we first perform an exploratory study on the source code changes that introduce performance regressions in order to understand root-causes of performance regression in the source code level. Second, we propose an approach that automatically predicts whether a test would manifest performance regressions in a code commit. To assist practitioners to analyze system performance with operational data, we propose an approach to recovering field-representative workload that can be used to detect performance regression. We also propose that using execution logs generated by unit tests to predict performance regression in load tests.

Formalization and analysis of quantitative attributes of distributed systems

Agustín E. Martinez Suñé

While there is not much discussion on the importance of formally describing and analyzing quantitative requirements in the process of software construction; in the paradigm of API-based software systems, it could be vital. Quantitative attributes can be thought of as attributes determining the Quality of Service - QoS provided by a software component published as a service. In this sense, they play a determinant role in classifying software artifacts according to specific needs stated as requirements.

In this work, we present a research program consisting of the development of formal languages and tools to characterize and analyze the Quality of Service attributes of software components in the context of distributed systems. More specifically, our main motivational scenario lays on the execution of a service-oriented architecture.

Bridging fault localisation and defect prediction

Jeongju Sohn

Identifying the source of a program failure plays an integral role in maintaining software quality. Both fault localisation and defect prediction aim to locate faults: fault localisation aims to locate faults after they are revealed while defect prediction aims to locate yet-to-happen faults. Despite sharing a similar goal, fault localisation and defect prediction have been studied as separate topics, mainly due to the difference in available data to exploit. In our doctoral research, we aim to bridge fault localisation and defect prediction. Our work is divided into three parts: 1) applying defect prediction to fault localisation, i.e., DP2FL, 2) applying fault localisation to defect prediction, i.e., FL2DP, 3) consecutive application of DP2FL and FL2DP in a single framework. We expect the synergy between fault localisation and defect prediction not only to improve the accuracy of each process but to allow us to build a single model that gradually improve the overall software quality throughout the entire software development life-cycle.

Improving students' testing practices

Gina R. Bai

Software testing prevents and detects the introduction of faults and bugs during the process of evolving and delivering reliable software. As an important software development activity, testing has been intensively studied to measure test code quality and effectiveness, and assist professional developers and testers with automated test generation tools. In recent years, testing has been attracting educators' attention and has been integrated into some Computer Science education programs. Understanding challenges and problems faced by students can help inform educators the topics that require extra attention and practice when presenting testing concepts and techniques.

In my research, I study how students implement and modify source code given unit tests, and how they perceive and perform unit testing. I propose to quantitatively measure the quality of student-written test code, and qualitatively identify the common mistakes and bad smells observed in student-written test code. We compare the performance of students and professionals, who vary in prior testing experience, to investigate the factors that lead to high-quality test code. The ultimate goal of my research is to address the challenges students encountered during test code composition and improve their testing skills with supportive tools or guidance.

The sustainability of quality in free and open source software

Adam Alami

We learned from the history of software that great software are the ones who manage to sustain their quality. Free and open source software (FOSS) has become a serious software supply channel. However, trust on FOSS products is still an issue. Quality is a trait that enhances trust. In my study, I investigate the following question: how do FOSS communities sustain their software quality? I argue that human and social factors contribute to the sustainability of quality in FOSS communities. Amongst these factors are: the motivation of participants, robust governance style for the software change process, and the exercise of good practices in the pull requests evaluation process.

Understanding software changes: extracting, classifying, and presenting fine-grained source code changes

Veit Frick

In modern software engineering, developers have to work with constantly evolving, interconnected software systems. Understanding how and why these systems and their dependencies between each other change is therefore an essential step in improving or maintaining them. For this, it is important to know what changed and how these changes influence the system. Most currently used tools that help developers to understand source code changes either use the textual representation of source code, allowing for a coarse-grained overview, or use the AST (abstract syntax tree) representation of source code to extract more fine-grained changes. We plan to improve the accuracy and classification of the extracted source code changes and to extend them by analysing the fine-grained changes of source code dependencies. We also propose a dynamical analysis of the impact of the previously extracted changes on performance metrics. This helps to understand what changes caused a certain change in program behaviour. We plan to use and combine this information to generate accurate and detailed change overviews that bridge the gap between existing coarse-grained solutions and the raw changes contained in the code, aiming to reduce the developers' time spent reading changed code and help them to quickly understand the changes between two versions of source code.

Search-based test generation for Android apps

Iván Arcuschin Moreno

Despite their growing popularity, apps tend to contain defects which can ultimately manifest as failures (or crashes) to end-users. Different automated tools for testing Android apps have been proposed in order to improve software quality. Although Genetic Algorithms and Evolutionary Algorithms (EA) have been promising in recent years, in light of recent results, it seems they are not yet fully tailored to the problem of Android test generation. Thus, this thesis aims to design and evaluate algorithms for alleviating the burden of testing Android apps. In particular, I plan to investigate which is the best search-based algorithm for this particular problem. As the thesis advances, I expect to develop a fully open-source test case generator for Android applications that will serve as a framework for comparing different algorithms. These algorithms will be compared using statistical analysis on both open-source (i.e., from F-Droid) and commercial applications (i.e., from Google Play Store).

Extracting archival-quality information from software-related chats

Preetha Chatterjee

Software developers are increasingly having conversations about software development via online chat services. Many of those chat communications contain valuable information, such as code descriptions, good programming practices, and causes of common errors/exceptions. However, the nature of chat community content is transient, as opposed to the archival nature of other developer communications such as email, bug reports and Q&A forums. As a result, important information and advice are lost over time.

The focus of this dissertation is Extracting Archival Information from Software-Related Chats, specifically to (1) automatically identify conversations that contain archival-quality information, (2) accurately reduce the granularity of the information reported as archival information, and (3) conduct a case study to investigate how archival quality information extracted from chats compare to related posts in Q&A forums. Archiving knowledge from developer chats could be used potentially in several applications such as: creating a new archival mechanism available to a given chat community, augmenting Q&A forums, or facilitating the mining of specific information and improving software maintenance tools.

Software startups in growth phase SE practices adopted to SEE

Orges Cico

Context: Software has become ubiquitous in every corner of modern societies. During the last five decades, software engineering has also changed significantly to advance the development of various types and scales of software products. In this context, Software Engineering Education plays an essential role in keeping students updated with software technologies, processes, and practices that are popular in industries. Aim: In this PhD work, I want to answer the following research questions: To what extent software engineering trends are present in software engineering education? In which way software startup in growth phase characteristics can be transferred into software engineering education context? What is the impact of software startup engineering in the curriculum and to software engineering students? Method: I utilize literature review and mix-methods approaches (quantitative and qualitative data and methods triangulation) in gathering empirical evidence. More precisely, I split my research method into two phases. The first phase of the research will acquire knowledge and insight based on the existing literature review. The second research phase will split the focus in two directions. Firstly, I shall gather empirical evidence on how software startup practices are present in software engineering education. Secondly, I will conduct parallel investigations into SE practices in growth phase software startups. Expected Results: I argue that software startup engineering practices are an ultimate tool for software engineering education approaches. I expect students to acquire software engineering skills in a more realistic context while using software startup in growth phase practices.

Towards better technical debt detection with NLP and machine learning methods

Leevi Rantala

Technical debt (TD) is an economical term used to depict non-optimal choices made in the software development process. It occurs usually when developers take shortcuts instead of following agreed upon development practices, and unchecked growth of technical debt can start to incur negative effects for software development processes.

Technical debt detection and management is mainly done manually, and this is both slow and costly way of detecting technical debt. Automatic detection would solve this issue, but even state-of-the-art tools of today do not accurately detect the appearance of technical debt. Therefore, increasing the accuracy of automatic classification is of high importance, so that we could eliminate significant portion from the costs relating to technical debt detection.

This research aims to solve the problem in detection accuracy by bringing in together static code analysis and natural language processing. This combination of techniques will allow more accurate detection of technical debt, when compared to them being used separately from each other. Research also aims to discover themes and topics from written developer messages that can be linked to technical debt. These can help us to understand technical debt from developers' viewpoint. Finally, we will build an open-source tool/plugin that can be used to accurately detect technical debt using both static analysis and natural language processing methods.

Towards data integrity in Cassandra database applications using conceptual models

Pablo Suárez-Otero

Data modeling in Cassandra databases follows a query-driven approach where each table is created to satisfy a query, leading to repeated data as the Cassandra model is not normalized by design. Consequently, developers bear the responsibility to maintain the data integrity at the application level, as opposed to when the model is normalized. This is done by embedding in the client application the appropriate statements to perform data changes, which is error prone. Cassandra data modeling methodologies have emerged to cope with this problem by proposing the use of a conceptual model to generate the logical model, solving the data modeling problem but not the data integrity one. In this thesis we address the problem of the integrity of these data by proposing a method that, given a data change at either the conceptual or the logical level, determines the executable statements that should be issued to preserve the data integrity. Additionally, as this integrity may also be lost as a consequence of creating new data structures in the logical model, we complement our method to preserve data integrity in these scenarios. Furthermore, we address the creation of data structures at the conceptual level to represent a normalized version of newly created data structures in the logical model.

POSTER SESSION: Posters

Experiential learning in computing accessibility education

Weishi Shi
Saad Khan
Yasmine El-Glaly
Samuel Malachowsky
Qi Yu
Daniel E. Krutz

Many developers don't understand how to, or recognize the need to develop accessible software. To address this, we have created five educational Accessibility Learning Labs (ALL) using an experiential learning structure. Each of these labs addresses a foundational concept in computing accessibility and both inform participants about foundational concepts in creating accessible software while also demonstrating the necessity of creating accessible software. The hosted labs provide a complete educational experience, containing materials such as lecture slides, activities, and quizzes.

We evaluated the labs in ten sections of a CS2 course at our university, with 276 students participating. Our primary findings include: I) The labs are an effective way to inform participants about foundational topics in creating accessible software II) The labs demonstrate the potential benefits of our proposed experiential learning format in motivating participants about the importance of creating accessible software III) The labs demonstrate that empathy material increases learning retention. Created labs and project materials are publicly available on the project website: http://all.rit.edu

Large-scale patch recommendation at Alibaba

Xindong Zhang
Chenguang Zhu
Yi Li
Jianmei Guo
Lihua Liu
Haobo Gu

We present Precfix, a pragmatic approach targeting large-scale industrial codebase and making recommendations based on previously observed debugging activities. Precfix collects defect-patch pairs from development histories, performs clustering, and extracts generic reusable patching patterns as recommendations. Our approach is able to make recommendations within milliseconds and achieves a false positive rate of 22%. Precfix has been rolled out to Alibaba to support various critical businesses.

Do preparatory programming lab sessions contribute to even work distribution in student teams?

Markus Borg

Unfair work distribution is common in project-based learning with teams of students. One contributing factor is that students are differently skilled developers. To mitigate the differences in a course with group work, we introduced mandatory programming lab sessions. The intervention did not affect the work distribution, showing that more is needed to balance the workload. Contrary to our goal, the intervention was very well received among experienced students, but unpopular with students weak at programming.

A practical, collaborative approach for modeling big data analytics application requirements

Hourieh Khalajzadeh
Andrew Simmons
Mohamed Abdelrazek
John Grundy
John Hosking
Qiang He
Prasanna Ratnakanthan
Adil Zia
Meng Law

Data analytics application development introduces many challenges including: new roles not in traditional software engineering practices - e.g. data scientists and data engineers; use of sophisticated machine learning (ML) model-based approaches; uncertainty inherent in the models; interfacing with models to fulfill software functionalities; deploying models at scale and rapid evolution of business goals and data sources. We describe our Big Data Analytics Modeling Languages (BiDaML) toolset to bring all stakeholders around one tool to specify, model and document big data applications. We report on our experience applying BiDaML to three real-world large-scale applications. Our approach successfully supports complex data analytics application development in industrial settings.

A mixed methods research agenda to identify undergraduate misconceptions in software engineering, lecturers' handling, and didactical implications

Carolin Gold-Veerkamp
Ira Diethelm
Jörg Abke

Due to the growing value of software technology in our everyday life, young professionals and undergraduates need to be well-qualified for Software Engineering (SE) careers. Additionally, its didactic basis is a recent development.

Understanding DevOps education with Grounded theory

Candy Pang
Abram Hindle
Denilson Barbosa

DevOps stands for Development-Operations. It arises from the IT industry as a movement aligning development and operations teams. DevOps is broadly recognized as an IT standard, and there is high demand for DevOps practitioners in industry. Therefore, we studied whether undergraduates acquired adequate DevOps skills to fulfill the demand for DevOps practitioners in industry. We employed Grounded Theory (GT), a social science qualitative research methodology, to study DevOps education from academic and industrial perspectives. In academia, academics were not motivated to learn or adopt DevOps, and we did not find strong evidence of academics teaching DevOps. Academics need incentives to adopt DevOps, in order to stimulate interest in teaching DevOps. In industry, DevOps practitioners lack clearly defined roles and responsibilities, for the DevOps topic is diverse and growing too fast. Therefore, practitioners can only learn DevOps through hands-on working experience. As a result, academic institutions should provide fundamental DevOps education (in culture, procedure, and technology) to prepare students for their future DevOps advancement in industry. Based on our findings, we proposed five groups of future studies to advance DevOps education in academia.

Understanding and handling alert storm for online service systems

Nengwen Zhao
Junjie Chen
Xiao Peng
Honglin Wang
Xinya Wu
Yuanzong Zhang
Zikai Chen
Xiangzhong Zheng
Xiaohui Nie
Gang Wang
Yong Wu
Fang Zhou
Wenchi Zhang
Kaixin Sui
Dan Pei

Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, since it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover useful alerts accurately.

Factors influencing software engineering career choice of Andean indigenous

Mary Sánchez-Gordón
Ricardo Colomo-Palacios

A diverse workforce is not just "nice to have", it is a reflection of a changing world. Such a diverse workforce brings high value to organizations and it is essential for developing the national technological innovation, economic vitality, and global competitiveness. Despite the importance of diversity in the broad field of computing, there is not only a comparatively low representation of women but also other underrepresented minorities, such as indigenous people. To gain insights about their career choice, we conducted 10 interviews with Andean indigenous. The findings reveal that seven factors (social support, exposure to digital technology, autonomy of use, purpose of use, digital skill, identity, and work ethic) help to understand how and why indigenous people choose a career related to Software Engineering. This exploratory study also contributes to challenge common stereotypes and perceptions about indigenous people as low-qualified workers, academically untalented, and unmotivated.

ProvBuild: improving data scientist efficiency with provenance

Jingmei Hu
Jiwon Joung
Maia Jacobs
Krzysztof Z. Gajos
Margo I. Seltzer

Data scientists frequently analyze data by writing scripts. We conducted a contextual inquiry with interdisciplinary researchers, which revealed that parameter tuning is a highly iterative process and that debugging is time-consuming. As analysis scripts evolve and become more complex, analysts have difficulty conceptualizing their workflow. In particular, after editing a script, it becomes difficult to determine precisely which code blocks depend on the edit. Consequently, scientists frequently re-run entire scripts instead of re-running only the necessary parts. We present ProvBuild, a data analysis environment that uses change impact analysis [1] to improve the iterative debugging process in script-based workflow pipelines. ProvBuild is a tool that leverages language-level provenance [2] to streamline the debugging process by reducing programmer cognitive load and decreasing subsequent runtimes, leading to an overall reduction in elapsed debugging time. ProvBuild uses provenance to track dependencies in a script. When an analyst debugs a script, ProvBuild generates a simplified script that contains only the information necessary to debug a particular problem. We demonstrate that debugging the simplified script lowers a programmer's cognitive load and permits faster re-execution when testing changes. The combination of reduced cognitive load and shorter runtime reduces the time necessary to debug a script. We quantitatively and qualitatively show that even though ProvBuild introduces overhead during a script's first execution, it is a more efficient way for users to debug and tune complex workflows. ProvBuild demonstrates a novel use of language-level provenance, in which it is used to proactively improve programmer productively rather than merely providing a way to retroactively gain insight into a body of code. To the best of our knowledge, ProvBuild is a novel application of change impact analysis and it is the first debugging tool to leverage language-level provenance to reduce cognitive load and execution time.

How has forking changed in the last 20 years?: a study of hard forks on GitHub

Shurui Zhou
Bogdan Vasilescu
Christian Kästner

The notion of forking has changed with the rise of distributed version control systems and social coding environments, like GitHub. Traditionally forking refers to splitting off an independent development branch (which we call hard forks); research on hard forks, conducted mostly in pre-GitHub days showed that hard forks were often seen critical as they may fragment a community. Today, in social coding environments, open-source developers are encouraged to fork a project in order to contribute to the community (which we call social forks), which may have also influenced perceptions and practices around hard forks. To revisit hard forks, we identify, study, and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that, among others, hard forks often evolve out of social forks rather than being planned deliberately and that perception about hard forks have indeed changed dramatically, seeing them often as a positive non-competitive alternative to the original project.

Scaling application-level dynamic taint analysis to enterprise-scale distributed systems

Xiaoqin Fu
Haipeng Cai

With the increasing deployment of enterprise-scale distributed systems, effective and practical defenses for such systems against various security vulnerabilities such as sensitive data leaks are urgently needed. However, most existing solutions are limited to centralized programs. For real-world distributed systems which are of large scales, current solutions commonly face one or more of scalability, applicability, and portability challenges. To overcome these challenges, we develop a novel dynamic taint analysis for enterprise-scale distributed systems. To achieve scalability, we use a multi-phase analysis strategy to reduce the overall cost. We infer implicit dependencies via partial-ordering method events in distributed programs to address the applicability challenge. To achieve greater portability, the analysis is designed to work at an application level without customizing platforms. Empirical results have shown promising scalability and capabilities of our approach.

Evolutionary hot-spots in software systems

Robert Benkoczi
Daya Gaur
Shahadat Hossain
Muhammad Khan
Ajay Raj Tedlapu

We propose a methodology to study and visualize the evolution of the modular structure of a network of functional dependencies in a software system. Our method identifies periods of significant refactoring activities, also known as the evolutionary hot spots in software systems. Our approach is based on clustering design structure matrices of functional dependencies and Kleinberg's method of identifying evolutionary hot-spots in dynamic networks. As a case study, we characterize the evolution of the modular structure of Octave over its entire life cycle.

Clairvoyance: cross-contract static analysis for detecting practical reentrancy vulnerabilities in smart contracts

Jiaming Ye
Mingliang Ma
Yun Lin
Yulei Sui
Yinxing Xue

Reentrancy bugs in smart contracts caused a devastating financial loss in 2016, considered as one of the most severe vulnerabilities in smart contracts. Most of the existing general-purpose security tools for smart contracts have claimed to be able to detect reentrancy bugs. In this paper, we present Clairvoyance, a cross-function and cross-contract static analysis by identifying infeasible paths to detect reentrancy vulnerabilities in smart contracts. To reduce FPs, we have summarized five major path protective techniques (PPTs) to support fast yet precise path feasibility checking. We have implemented our approach and compared Clairvoyance with three state-of-the-art tools on 17770 real-worlds contracts. The results show that Clairvoyance yields the best detection accuracy among all the tools.

Towards automatic assessment of object-oriented programs

Pasquale Ardimento
Mario Luca Bernardi
Marta Cimitile

The computing education community has shown a long-time interest in how to analyze the Object-Oriented (OO) source code developed by students to provide them with useful formative tips. In this paper, we propose and evaluate an approach to analyze how students use Java and its language constructs. The approach is implemented through a cloud-based integrated development environment (IDE) and it is based on the analysis of the most common violations of the OO paradigm in the student source code. Moreover, the IDE supports the automatic generation of reports about student's mistakes and misconceptions that can be used by instructors to improve the course design. The paper discusses the preliminary results of an experiment performed in a class of a Programming II course to investigate the effects of the provided reports in terms of coding ability (concerning the correctness of the produced code).

Bugine: a bug report recommendation system for Android apps

Ziqiang Li
Shin Hwei Tan

Many automated test generation tools were proposed for finding bugs in Android apps. However, a recent study revealed that developers prefer reading automated test generation cased written in natural language. We present Bugine, a new bug recommendation system that automatically selects relevant bug reports from other applications that have similar bugs. Bugine (1) searches for GitHub issues that mentioned common UI components shared between the app under test and the apps in our database, and (2) ranks the quality and relevance of issues. Our results show that Bugine could find 34 new bugs in five evaluated apps.

Identifying compatibility-related APIs by exploring biased distribution in Android apps

Chen Xu
Yan Xiong
Wenchao Huang
Zhaoyi Meng
Fuyou Miao
Cheng Su
Guangshuai Mo

With the prosperity of Android, developers need to deal with the compatibility issues among different devices, which is costly. In this paper, we propose an automated and general approach named ICARUS to identify compatibility-related APIs in Android apps. The insight of our approach is that the compatibility-related API has the biased distribution among code segments, which is similar to the distribution of keywords among documents. It motivates us to leverage statistical features to discriminate compatibility-related APIs from normal APIs. Experimental results on apps demonstrate the effectiveness of our work.

Lean kanban in an industrial context: a success story

Roberto Hens Pato
David Granada
Juan M. Vara
Esperanza Marcos

Even though Lean principles have already been broadly applied to the manufacturing industry [1], we cannot say the same regarding software development. The objective of this article is therefore to present a real experience where the Lean Kanban method [2] was applied by a software development team from an IT consulting firm. The team (7 people) is responsible for the maintenance of internal management applications at a large governmental organization (over 4,000 employees). It had to combine new evolutionary developments with corrective maintenance and incident resolution within the production area of 20 to 25 information systems with heterogeneous purposes and technologies.

Preliminary findings on FOSS dependencies and security: a qualitative study on developers' attitudes and experience

Ivan Pashchenko
Duc-Ly Vu
Fabio Massacci

Developers are known to keep third-party dependencies of their projects outdated even if some of them are affected by known vulnerabilities. In this study we aim to understand why they do so. For this, we conducted 25 semi-structured interviews with developers of both large and small-medium enterprises located in nine countries. All interviews were transcribed, coded, and analyzed according to applied thematic analysis. The results of the study reveal important aspects of developers' practices that should be considered by security researchers and dependency tool developers to improve the security of the dependency management process.

What disconnects practitioner belief and empirical evidence?

N. C. Shrikanth
Tim Menzies

Just because software developers say they believe in "X", that does not necessarily mean that "X" is true. As shown here, there exist numerous beliefs listed in the recent Software Engineering literature which are only supported by small portions of the available data. Hence we ask what is the source of this disconnect between beliefs and evidence?.

To answer this question we look for evidence for ten beliefs within 300,000+ changes seen in dozens of open-source projects. Some of those beliefs had strong support across all the projects; specifically, "A commit that involves more added and removed lines is more bug-prone" and "Files with fewer lines contributed by their owners (who contribute most changes) are bug-prone".

Most of the widely-held beliefs studied are only sporadically supported in the data; i.e. large effects can appear in project data and then disappear in subsequent releases. Such sporadic support explains why developers believe things that were relevant to their prior work, but not necessarily their current work.

Restoring reproducibility of Jupyter notebooks

Jiawei Wang
Tzu-yang Kuo
Li Li
Andreas Zeller

Jupyter notebooks---documents that contain live code, equations, visualizations, and narrative text---now are among the most popular means to compute, present, discuss and disseminate scientific findings. In principle, Jupyter notebooks should easily allow to reproduce and extend scientific computations and their findings; but in practice, this is not the case. The individual code cells in Jupyter notebooks can be executed in any order, with identifier usages preceding their definitions and results preceding their computations. In a sample of 936 published notebooks that would be executable in principle, we found that 73% of them would not be reproducible with straightforward approaches, requiring humans to infer (and often guess) the order in which the authors created the cells.

In this paper, we present an approach to (1) automatically satisfy dependencies between code cells to reconstruct possible execution orders of the cells; and (2) instrument code cells to mitigate the impact of non-reproducible statements (i.e., random functions) in Jupyter notebooks. Our Osiris prototype takes a notebook as input and outputs the possible execution schemes that reproduce the exact notebook results. In our sample, Osiris was able to reconstruct such schemes for 82.23% of all executable notebooks, which has more than three times better than the state-of-the-art; the resulting reordered code is valid program code and thus available for further testing and analysis.

Identification of cultural influences on requirements engineering activities

Tawfeeq Alsanoosy
Maria Spichkova
James Harland

Requirements Engineering (RE) involves critical activities to ensure the accurate elicitation and documentation of clients' requirements. RE is a socio-technical activity and requires intensive communication with several clients. RE activities might be considerably influenced by individuals' cultural background because culture has a deep impact on the way in which people communicate. However, there has been limited exploration of this issue. We present a framework that identifies and analyses cultural influences on RE activities. To build the framework, we employed Hofstede's cultural model and a mixed-methods design comprising two case studies involving two cultures: Saudi Arabia and Australia. The evaluation highlighted that the framework provides high accuracy to identify cultural influences in other cultures as well.

An exploratory study on improving automated issue triage with attached screenshots

Ethem Utku Aktas
Cemal Yilmaz

Issue triage is a manual and time consuming process for both open and closed source software projects. Triagers first validate the issue reports and then find the appropriate developers or teams to solve them. In our industrial case, we automated the assignment part of the problem with a machine learning based approach. However, the automated system's average accuracy performance is 3% below the human triagers' performance. In our effort to improve our approach, we analyzed the incorrectly assigned issue reports and realized that many of them have attachments with them, which are mostly screenshots. Such issue reports generally have short descriptions compared to the ones without attachments, which we consider as one of the reasons for incorrect classification. In this study, we describe our proposed approach to include this new piece of information for issue triage and present the initial results.

Open-vocabulary models for source code

Rafael-Michael Karampatsis
Hlib Babii
Romain Robbes
Charles Sutton
Andrea Janes

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.

In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work, and outperforms the state of the art. To our knowledge, this is the largest NLM for code that has been reported.

Building a theory of software teams organization in a continuous delivery context

Leonardo Leite
Fabio Kon
Gustavo Pinto
Paulo Meirelles

Based on Grounded Theory guidelines, we interviewed 27 IT professionals to investigate how organizations pursuing continuous delivery should organize their development and operations teams. In this paper, we present the discovered organizational structures: (1) siloed departments, (2) classical DevOps, (3) cross-functional teams, and (4) platform teams.

An intelligent tool for combatting contract cheating behaviour by facilitating scalable student-tutor discussions

Jake Renzella
Andrew Cain
Jean-Guy Schneider

With the global increase in demand for online tertiary education, teachers are facing unique challenges in scaling assessment activities and meaningful student engagement. One such aspect is contract cheating behaviours exhibited in the modern online environment --- posing a threat to the academic integrity of tertiary education. These obstacles amplify when applied to traditionally difficult domains like introductory programming education. Prior research on contract cheating identification proposes that while challenging, techniques such as developing strong teacher-student relationships, and real-time discussions may lead to instances of identifying contract cheating behaviours. The proposition, then, is to scale real-time, student-teacher discussions with large, online cohorts --- similar to those discussions which traditionally took place in the classroom. This poster paper presents Intelligent Discussion Comments (IDCs): A scalable, teacher-asynchronous system which engages students in real-time discussions to extract authentic student understanding. Artificial intelligence services such as voice identification and transcription enrich the discussion process, supporting the teaching team in their decision-making process.

Debugging inputs

Lukas Kirschner
Ezekiel Soremekun
Andreas Zeller

Program failures are often caused by invalid inputs, for instance due to input corruption. To obtain the passing input, one needs to debug the data. In this paper we present a generic technique called ddmax that (1) identifies which parts of the input data prevent processing, and (2) recovers as much of the (valuable) input data as possible. To the best of our knowledge, ddmax is the first approach that fixes faults in the input data without requiring program analysis. In our evaluation, ddmax repaired about 69% of input files and recovered about 78% of data within one minute per input.

Managing data constraints in database-backed web applications

Junwen Yang
Utsav Sethi
Cong Yan
Alvin Cheung
Shan Lu

There are often constraints associated with data used in software, describing the expected length, value, uniqueness, and other properties of the stored data. Correctly specifying and checking such constraints are crucial for reliability, maintainability, and usability of software. This is particularly important for database-backed web applications, where a huge amount of data generated by millions of users plays a central role in user interaction and application logic. Furthermore, such data persists in database and needs to continue serving users despite frequent software upgrades [2] and data migration [1]. As a result, consistently and comprehensively specifying data constraints, checking them, and handling constraint violations are of uttermost importance.

Testing DNN image classifiers for confusion & bias errors

Yuchi Tian
Ziyuan Zhong
Vicente Ordonez
Gail Kaiser
Baishakhi Ray

We found that many of the reported erroneous cases in popular DNN image classifiers occur because the trained models confuse one class with another or show biases towards some classes over others. Most existing DNN testing techniques focus on per-image violations, so fail to detect class-level confusions or biases. We developed a testing technique to automatically detect class-based confusion and bias errors in DNN-driven image classification software. We evaluated our implementation, DeepInspect, on several popular image classifiers with precision up to 100% (avg. 72.6%) for confusion errors, and up to 84.3% (avg. 66.8%) for bias errors.

Fluid intelligence doesn't matter!: effects of code examples on the usability of crypto APIs

Kai Mindermann
Stefan Wagner

Context: Programmers frequently look for the code of previously solved problems that they can adapt for their own problem. Despite existing example code on the web, on sites like Stack Overflow, cryptographic Application Programming Interfaces (APIs) are commonly misused. There is little known about what makes examples helpful for developers in using crypto APIs. Analogical problem solving is a psychological theory that investigates how people use known solutions to solve new problems. There is evidence that the capacity to reason and solve novel problems a.k.a Fluid Intelligence (Gf) and structurally and procedurally similar solutions support problem solving. Aim: Our goal is to understand whether similarity and Gf also have an effect in the context of using cryptographic APIs with the help of code examples. Method: We conducted a controlled experiment with 76 student participants developing with or without procedurally similar examples, one of two Java crypto libraries and measured the Gf of the participants as well as the effect on usability (effectiveness, efficiency, satisfaction) and security bugs. Results: We observed a strong effect of code examples with a high procedural similarity on all dependent variables. Fluid intelligence Gf had no effect. It also made no difference which library the participants used. Conclusions: Example code must be more highly similar to a concrete solution, not very abstract and generic to have a positive effect in a development task.

Semantic analysis of issues on Google play and Twitter

Aman Yadav
Fatemeh H. Fard

Mobile app users post their opinion about the apps, report bugs or request features on various platforms, the main one being App Stores. Previous research suggests that Twitter should be used as an additional resource to receive users' feedback, as app users tweet different issues. Although the classification and review summarization methods are developed previously for each platform separately, manual investigation of reviews or tweets is still required to identify the similar or different points that are discussed on App Store or Twitter. In this paper, we propose a framework to study the differences or similarities among app reviews from Google Play Store and tweets automatically by using the semantics of the words. The results from several experiments compared with expert evaluation, confirm that it can be applied to identify the similarities or differences among the extracted topics, n-grams, and users' comments.

Summary-guided incremental symbolic execution

Qiuping Yi
Junye Wen
Guowei Yang

Symbolic execution is a powerful technique for systematically exploring program paths, but scaling symbolic execution to practical programs remains challenging. State-of-the-art techniques face the challenge to efficiently explore incremental behaviors, especially for highly coupled programs with complex control and data dependency. In this paper, we present a novel approach for incremental symbolic execution based on an iteration loop between path exploration and path suffixes summarization. On one hand, the explored paths are summarized to enable more precise identification of affected paths; on the other hand, the summary guides path exploration to prune paths that have no incremental behaviors. We implemented the prototype of our approach and conducted experiments on a set of real-world applications. The results show that it is efficient and effective in exploring incremental behaviors.

Elite developers' activities at open source ecosystem level

Zhendong Wang
Yang Feng
Yi Wang
James A. Jones
David Redmiles

OSS ecosystems promote code reuse, and knowledge sharing across projects within them. An ecosystem's developers often develop similar activity patterns which might impact project outcomes in an ecosystem-specific way. Since elite developers play critical roles in most OSS projects, investigating their behaviors at the ecosystem level becomes urgent. Thus, we aim to investigate elite developers' activities and their relationships with project outcomes (productivity and quality). We design an large scale empirical study which characterizes elite developers' activity profiles and identifies the relationships between their effort allocations and project outcomes across five ecosystems. Our current results and findings reveal that elite developers in each ecosystem do behave in ecosystem-specific ways. Further, we find that the elites' effort allocations on different activity categories are potentially correlated with project outcomes.

GUI-focused overviews of mobile development videos

Mohammad Alahmadi
Abdulkarim Khormi
Sonia Haiduc

The need for mobile applications and mobile programming is increasing due to the continuous rise in the pervasiveness of mobile devices. Developers often refer to video programming tutorials to learn more about mobile programming topics. To find the right video to watch, developers typically skim over several videos, looking at their title, description, and video content in order to determine if they are relevant to their information needs. Unfortunately, the title and description do not always provide an accurate overview, and skimming over videos is time-consuming and can lead to missing important information. We propose a novel approach that locates and extracts the GUI screens showcased in a video tutorial, then selects and displays the most representative ones to provide a GUI-focused overview of the video. We believe this overview can be used by developers as an additional source of information for determining if a video contains the information they need. To evaluate our approach, we performed an empirical study on iOS and Android programming screencasts which investigates the accuracy of our automated GUI extraction. The results reveal that our approach can detect and extract GUI screens with an accuracy of 94%.

Improving automated program repair using two-layer tree-based neural networks

Yi Li
Shaohua Wang
Tien N. Nguyen

We present DLFix, a two-layer tree-based model learning bug-fixing code changes and their surrounding code context to improve Automated Program Repair (APR). The first layer learns the surrounding code context of a fix and uses it as weights for the second layer that is used to learn the bug-fixing code transformation. Our empirical results on Defect4J show that DLFix can fix 30 bugs and its results are comparable and complementary to the best performing pattern-based APR tools. Furthermore, DLFix can fix 2.5 times more bugs than the best performing deep learning baseline.

An empirical study on the characteristics of question-answering process on developer forums

Yi Li
Shaohua Wang
Tien N. Nguyen

Developer forums are one of the most popular and useful Q&A websites on API usages. The analysis of API forums can be a critical step towards automated question and answer approaches. In this poster, we empirically study three API forums: Twitter, eBay, and AdWords, to investigate the characteristics of question-answering process. We observe that +60% of the posts on all forums were answered with API method names or documentation. +85% of the questions were answered by API development teams and the answers from API development teams drew fewer follow-up questions. Our results provide empirical evidence in future work to build automated solutions to answer developer questions on API forums.

Towards understanding and fixing upstream merge induced conflicts in divergent forks: an industrial case study

Chungha Sung
Shuvendu K. Lahiri
Mike Kaufman
Pallavi Choudhury
Jessica Wolk
Chao Wang

Divergent forks are a common practice in open-source software development to perform long-term, independent and diverging development on top of a popular source repository. However, keeping such divergent downstream forks in sync with the upstream source evolution poses engineering challenges in terms of frequent merge conflicts. In this work, we conduct the first industrial case study of frequent merges from upstream and the resulting merge conflicts, in the context of Microsoft Edge development. The study consists of two parts. First, we describe the nature of merge conflicts that arise due to merges from upstream. Second, we investigate the feasibility of automatically fixing a class of merge conflicts related to build breaks that consume a significant amount of developer time to root-cause and fix. Towards this end, we have implemented a tool MrgBldBrkFixer and evaluate it on three months of real Microsoft Edge Beta development data, and report encouraging results.

Importance-driven deep learning system testing

Simos Gerasimou
Hasan Ferit Eniser
Alper Sen
Alper Cakan

Deep Learning (DL) systems are key enablers for engineering intelligent applications. Nevertheless, using DL systems in safety- and security-critical applications requires to provide testing evidence for their dependable operation. We introduce DeepImportance, a systematic testing methodology accompanied by an Importance-Driven (IDC) test adequacy criterion for DL systems. Applying IDC enables to establish a layer-wise functional understanding of the importance of DL system components and use this information to assess the semantic diversity of a test set. Our empirical evaluation on several DL systems and across multiple DL datasets demonstrates the usefulness and effectiveness of DeepImportance.

Refactor4Green: a game for novice programmers to learn code smells

Vartika Agrahari
Sridhar Chimalakonda

The rise in awareness of sustainable software has led to a focus on energy efficiency and consideration of code smells during software development. This eventually requires software engineering teachers to focus on topics such as code smells in their software engineering courses to bring awareness among students on the impact of code smells and bad design choices, not just for the software but also for the environment. Thus, we propose a desktop game named Refactor4Green to teach code smells and refactoring to novice programmers. The core idea of the game is to introduce code smells with refactoring choices through the theme of a green environment. We conducted a preliminary study with university students and got positive feedback from 83.06% of the participants.

Industry Agile practices in large-scale capstone projects

Jean-Guy Schneider
Peter W. Eklund
Kevin Lee
Feifei Chen
Andrew Cain
Mohamed Abdelrazek

To give students as authentic learning experience as possible, many software-focused degrees incorporate team-based capstone projects in the final year of study. Designing capstone projects, however, is not a trivial undertaking, and a number of constraints need to be considered, especially when it comes to defining learning outcomes, choosing clients and projects, providing guidance to students, creating an effective project "support infrastructure", and measuring student outcomes. To address these challenges, we propose a novel, scalable model for managing capstone projects, called ACE, that adapts Spotify's Squads and Tribes organization to an educational setting. We present our motivation, the key components of the model, its adoption, and refer to preliminary observations.

Real-world ethics for self-driving cars

Tobias Holstein
Gordana Dodig-Crnkovic
Patrizio Pelliccione

Ethical and social problems of the emerging technology of self-driving cars can best be addressed through an applied engineering ethical approach. However, currently social and ethical problems are typically being presented in terms of an idealized unsolvable decision-making problem, the so-called Trolley Problem. Instead, we propose that ethical analysis should focus on the study of ethics of complex real-world engineering problems. As software plays a crucial role in the control of self-driving cars, software engineering solutions should handle actual ethical and social considerations. We take a closer look at the regulative instruments, standards, design, and implementations of components, systems, and services and we present practical social and ethical challenges that must be met in the ecology of the socio-technological system of self-driving cars which implies novel expectations for software engineering in the automotive industry.