ESEM '21: Proceedings of the 15th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

Full Citation in the ACM Digital Library

SESSION: Keynote Papers

How Empirical Research Supports Tool Development: A Retrospective Analysis and new Horizons

Empirical research provides two-fold support to the development of approaches and tools aimed at supporting software engineers. On the one hand, empirical studies help to understand a phenomenon or context of interest. On the other hand, studies compare approaches and evaluate how software engineers could benefit from them. Over the past decades, there has been a tangible evolution in how empirical evaluation is conducted in software engineering. This is due to multiple reasons. First, the research community has matured a lot thanks also to guidelines developed by several researchers. Second, the large availability of data and artifacts, mainly from the open-source, has made it possible to conduct larger evaluations, and in some cases to reach study participants. This keynote will first overview how empirical research has been used over the past decades to evaluate tools, and how this is changing over the years. Then, we will focus on the importance of combining quantitative and qualitative evaluations, and how sometimes "depth" turns out to be more useful than just "breadth". We will also emphasize how research is not a straightforward path, and negative results are often an essential component for future advances. Last, but not least, we will discuss how the role of empirical evaluation is changing with the pervasiveness of artificial intelligence methods in software engineering research.

Measurement Challenges for Cyber Cyber Digital Twins: Experiences from the Deployment of Facebook's WW Simulation System

A cyber cyber digital twin is a deployed software model that executes in tandem with the system it simulates, contributing to, and drawing from, the system's behaviour. This paper outlines Facebook's cyber cyber digital twin, dubbed WW, a twin of Facebook's WWW platform, built using web-enabled simulation. The paper focuses on the current research challenges and opportunities in the area of measurement. Measurement challenges lie at the heart of modern simulation. They directly impact how we use simulation outcomes for automated online and semi-automated offline decision making. Measurements also encompas how we verify and validate those outcomes. Modern simulation systems are increasingly becoming more like cyber cyber digital twins, effectively moving from manual to automated decision making, hence, these measurement challenges acquire ever greater significance.

SESSION: Technical Papers

A Model of Software Prototyping based on a Systematic Map

Background: Prototyping is an established practice for user interface design and for requirements engineering within agile software development, even so there is a lack of theory on prototyping. Aims: The main research objective is to provide a means to categorise prototyping instances, in order to enable comparison and reflection of prototyping practices. Method: We have performed a systematic mapping study of methodological aspects of prototyping consisting of thirty-three primary studies upon which we designed a model of prototyping that was validated through a focus group at a case company. Results: Our model consists of four aspects of prototyping, namely purpose, prototype scope, prototype use, and exploration strategy. This model supported the focus group participants in discussing prototyping practices by considering concrete prototyping instances in terms of the concepts provided by our model. Conclusions: The model can be used to categorise prototyping instances and can support practitioners in reflecting on their prototyping practices. Our study provides a starting point for further research on prototyping and into how the practice can be applied more cost-effectively to elicit, validate, and communicate requirements.

A Survey-Based Qualitative Study to Characterize Expectations of Software Developers from Five Stakeholders

Background. Studies on developer productivity and well-being find that the perceptions of productivity in a software team can be a socio-technical problem. Intuitively, problems and challenges can be better handled by managing expectations in software teams. Aim. Our goal is to understand whether the expectations of software developers vary towards diverse stakeholders in software teams. Method. We surveyed 181 professional software developers to understand their expectations from five different stakeholders: (1) organizations, (2) managers, (3) peers, (4) new hires, and (5) government and educational institutions. The five stakeholders are determined by conducting semi-formal interviews of software developers. We ask open-ended survey questions and analyze the responses using open coding. Results. We observed 18 multi-faceted expectations types. While some expectations are more specific to a stakeholder, other expectations are cross-cutting. For example, developers expect work-benefits from their organizations, but expect the adoption of standard software engineering (SE) practices from their organizations, peers, and new hires. Conclusion. Out of the 18 categories, three categories are related to career growth. This observation supports previous research that happiness cannot be assured by simply offering more money or a promotion. Among the most number of responses, we find expectations from educational institutions to offer relevant teaching and from governments to improve job stability, which indicate the increasingly important roles of these organizations to help software developers. This observation can be especially true during the COVID-19 pandemic.

A comparative study of vulnerability reporting by software composition analysis tools

Background: Modern software uses many third-party libraries and frameworks as dependencies. Known vulnerabilities in these dependencies are a potential security risk. Software composition analysis (SCA) tools, therefore, are being increasingly adopted by practitioners to keep track of vulnerable dependencies. Aim: The goal of this study is to understand the difference in vulnerability reporting by various SCA tools. Understanding if and how existing SCA tools differ in their analysis may help security practitioners to choose the right tooling and identify future research needs. Method: We present an in-depth case study by comparing the analysis reports of 9 industry-leading SCA tools on a large web application, OpenMRS, composed of Maven (Java) and npm (JavaScript) projects. Results: We find that the tools vary in their vulnerability reporting. The count of reported vulnerable dependencies ranges from 17 to 332 for Maven and from 32 to 239 for npm projects across the studied tools. Similarly, the count of unique known vulnerabilities reported by the tools ranges from 36 to 313 for Maven and from 45 to 234 for npm projects. Our manual analysis of the tools' results suggest that accuracy of the vulnerability database is a key differentiator for SCA tools. Conclusion: We recommend that practitioners should not rely on any single tool at the present, as that can result in missing known vulnerabilities. We point out two research directions in the SCA space: i) establishing frameworks and metrics to identify false positives for dependency vulnerabilities; and ii) building automation technologies for continuous monitoring of vulnerability data from open source package ecosystems.

An Empirical Analysis of Practitioners' Perspectives on Security Tool Integration into DevOps

Background: Security tools play a vital role in enabling developers to build secure software. However, it can be quite challenging to introduce and fully leverage security tools without affecting the speed or frequency of deployments in the DevOps paradigm. Aims: We aim to empirically investigate the key challenges practitioners face when integrating security tools into a DevOps workflow in order to provide recommendations for overcoming the challenges. Method: We conducted a study involving 31 systematically selected webinars on integrating security tools in DevOps. We used a qualitative data analysis method, i.e., thematic analysis, to identify the challenges and emerging solutions related to integrating security tools in rapid deployment environments. Results: We find that whilst traditional security tools are unable to cater for the needs of DevOps, the industry is moving towards new generations of security tools that have started focusing on the needs of DevOps. We have developed a DevOps workflow that integrates security tools and a set of guidelines by synthesizing practitioners' recommendations in the analyzed webinars. Conclusion: Whilst the latest security tools are addressing some of the requirements of DevOps, there are many tool-related drawbacks yet to be adequately addressed.

An Empirical Examination of the Impact of Bias on Just-in-time Defect Prediction

Background: Just-In-Time (JIT) defect prediction models predict if a commit will introduce defects in the future. DeepJIT and CC2Vec are two state-of-the-art JIT Deep Learning (DL) techniques. Usually, defect prediction techniques are evaluated, treating all training data equally. However, data is usually imbalanced not only in terms of the overall class label (e.g., defect and non-defect) but also in terms of characteristics such as File Count, Edit Count, Multiline Comments, Inward Dependency Sum etc. Prior research has investigated the impact of class imbalance on prediction technique's performance but not the impact of imbalance of other characteristics. Aims: We aim to explore the impact of different commit related characteristic's imbalance on DL defect prediction. Method: We investigated different characteristic's impact on the overall performance of DeepJIT and CC2Vec. We also propose a Siamese network based few-shot learning framework for JIT defect prediction (SifterJIT) combining Siamese network and DeepJIT. Results: Our results show that DeepJIT and CC2Vec lose out on the performance by around 20% when trained and tested on imbalanced data. However, SifterJIT can outperform state-of-the-art DL techniques with an average of 8.65% AUC score, 11% precision, and 6% F1-score improvement. Conclusions: Our results highlight that dataset imbalanced in terms of commit characteristics can significantly impact prediction performance, and few-shot learning based techniques can help alleviate the situation.

An Empirical Study of Rule-Based and Learning-Based Approaches for Static Application Security Testing

Background: Static Application Security Testing (SAST) tools purport to assist developers in detecting security issues in source code. These tools typically use rule-based approaches to scan source code for security vulnerabilities. However, due to the significant shortcomings of these tools (i.e., high false positive rates), learning-based approaches for Software Vulnerability Prediction (SVP) are becoming a popular approach. Aims: Despite the similar objectives of these two approaches, their comparative value is unexplored. We provide an empirical analysis of SAST tools and SVP models, to identify their relative capabilities for source code security analysis. Method: We evaluate the detection and assessment performance of several common SAST tools and SVP models on a variety of vulnerability datasets. We further assess the viability and potential benefits of combining the two approaches. Results: SAST tools and SVP models provide similar detection capabilities, but SVP models exhibit better overall performance for both detection and assessment. Unification of the two approaches is difficult due to lacking synergies. Conclusions: Our study generates 12 main findings which provide insights into the capabilities and synergy of these two approaches. Through these observations we provide recommendations for use and improvement.

An Empirical Study on Refactoring-Inducing Pull Requests

Background: Pull-based development has shaped the practice of Modern Code Review (MCR), in which reviewers can contribute code improvements, such as refactorings, through comments and commits in Pull Requests (PRs). Past MCR studies uniformly treat all PRs, regardless of whether they induce refactoring or not. We define a PR as refactoring-inducing, when refactoring edits are performed after the initial commit(s), as either a result of discussion among reviewers or spontaneous actions carried out by the PR developer. Aims: This mixed study (quantitative and qualitative) explores code reviewing-related aspects intending to characterize refactoring-inducing PRs. Method: We hypothesize that refactoring-inducing PRs have distinct characteristics than non-refactoring-inducing ones and thus deserve special attention and treatment from researchers, practitioners, and tool builders. To investigate our hypothesis, we mined a sample of 1,845 Apache's merged PRs from GitHub, mined refactoring edits in these PRs, and ran a comparative study between refactoring-inducing and non-refactoring-inducing PRs. We also manually examined 2,096 review comments and 1,891 detected refactorings from 228 refactoring-inducing PRs. Results: We found 30.2% of refactoring-inducing PRs in our sample and that they significantly differ from non-refactoring-inducing ones in terms of number of commits, code churn, number of file changes, number of review comments, length of discussion, and time to merge. However, we found no statistical evidence that the number of reviewers is related to refactoring-inducement. Our qualitative analysis revealed that at least one refactoring edit was induced by review in 133 (58.3%) of the refactoring-inducing PRs examined. Conclusions: Our findings suggest directions for researchers, practitioners, and tool builders to improve practices around pull-based code review.

An Exploratory Study on Dead Methods in Open-source Java Desktop Applications

Background. Dead code is a code smell. It can refer to code blocks, fields, methods, etc. that are unused and/or unreachable. Empirical evidence shows that dead code harms source code comprehensibility and maintainability in software applications. Researchers have gathered little empirical evidence on the spread of dead code in software applications. Moreover, we know little about the role of this code smell during software evolution.

Aims. Our goal is to gather preliminary empirical evidence on the spread and evolution of dead methods in open-source Java desktop applications. Given the exploratory nature of our investigation, we believe that its results can justify more resource- and time-demanding research on dead methods.

Method. We quantitatively analyzed the commit histories of 13 open-source Java desktop applications, whose software projects were hosted on GitHub, for a total of 1,044 commits. We focused on dead methods detected at a commit level to investigate the spread and evolution of dead methods in the studied applications. The perspective of our explorative study is that of both practitioners and researchers.

Results. The most important take-away results can be summarized as follows: (i) dead methods seems to affect open-source Java desktop applications; (ii) dead methods generally survive for a long time, in terms of commits, before being "buried" or "revived;" (iii) dead methods are rarely revived; and (iv) most dead methods are dead since the creation of the corresponding methods. Conclusions. We conclude that developers should carefully handle dead methods in open-source Java desktop applications since this code smell is harmful, widespread, rarely revived, and survives for a long time in software applications. Our results also justify future research on dead methods.

Barriers to Shift-Left Security: The Unique Pain Points of Writing Automated Tests Involving Security Controls

Background: Automated unit and integration tests allow software development teams to continuously evaluate their application's behavior and ensure requirements are satisfied. Interest in explicitly testing security at the unit and integration levels has risen as more teams begin to shift security left in their workflows, but there is little insight into any potential pain points developers may experience as they learn to adapt their existing skills to write these tests. Aims: Identify security unit and integration testing pain points that could negatively impact efforts to shift security (testing) left to this level. Method: An mixed-method empirical study was conducted on 525 Stack Overflow and Security Stack Exchange posts related to security unit and integration testing. Latent Dirichlet Allocation (LDA) was applied to identify commonly discussed topics, pain points were learned through qualitative analysis, and links were analyzed to study commonly-shared resources. Results: Nine topics representing security controls, components, and scenarios were identified; Authentication was the most commonly tested control. Developers experienced seven pain points unique to security unit and integration testing, which were all influenced by the complexity of security control designs and implementations. Most linked resources were other Q&A posts, but repositories and documentation for security tools and libraries were also common. Conclusions: Developers may experience several unique pain points when writing tests at this level involving security controls. Additional resources are needed to guide developers through these challenges, which should also influence the creation of strategies and tools to help shift security testing to this level. To accelerate this, actionable recommendations for practitioners and future research directions based on these findings are highlighted.

Characteristics and Challenges of Low-Code Development: The Practitioners' Perspective

Background: In recent years, Low-code development (LCD) is growing rapidly, and Gartner and Forrester have predicted that the use of LCD is very promising. Giant companies, such as Microsoft, Mendix, and Outsystems have also launched their LCD platforms. Aim: In this work, we explored two popular online developer communities, Stack Overflow (SO) and Reddit, to provide insights on the characteristics and challenges of LCD from a practitioners' perspective. Method: We used two LCD related terms to search the relevant posts in SO and extracted 73 posts. Meanwhile, we explored three LCD related subreddits from Reddit and collected 228 posts. We extracted data from these posts and applied the Constant Comparison method to analyze the descriptions, benefits, and limitations and challenges of LCD. For platforms and programming languages used in LCD, implementation units in LCD, supporting technologies of LCD, types of applications developed by LCD, and domains that use LCD, we used descriptive statistics to analyze and present the results. Results: Our findings show that: (1) LCD may provide a graphical user interface for users to drag and drop with little or even no code; (2) the equipment of out-of-the-box units (e.g., APIs and components) in LCD platforms makes them easy to learn and use as well as speeds up the development; (3) LCD is particularly favored in the domains that have the need for automated processes and workflows; and (4) practitioners have conflicting views on the advantages and disadvantages of LCD. Conclusions: Our findings suggest that researchers should clearly define the terms when they refer to LCD, and developers should consider whether the characteristics of LCD are appropriate for their projects.

Characterizing and Predicting Good First Issues

Background. Where to start contributing to a project is a critical challenge for newcomers of open source projects. To support newcomers, GitHub utilizes the Good First Issue (GFI) label, with which project members can manually tag issues in an open source project that are suitable for the newcomers. However, manually labeling GFIs is time- and effort-consuming given the large number of candidate issues. In addition, project members need to have a close understanding of the project to label GFIs accurately.

Aims. This paper aims at providing a thorough understanding of the characteristics of GFIs and an automatic approach in GFIs prediction, to reduce the burden of project members and help newcomers easily onboard.

Method. We first define 79 features to characterize the GFIs and further analyze the correlation between each feature and GFIs. We then build machine learning models to predict GFIs with the proposed features.

Results. Experiments are conducted with 74,780 issues from 10 open source projects from GitHub. Results show that features related to the semantics, readability, and text richness of issues can be used to effectively characterize GFIs. Our prediction model achieves a median AUC of 0.88. Results from our user study further prove its potential practical value.

Conclusions. This paper provides new insights and practical guidelines to facilitate the understanding of GFIs and the automation of GFIs labeling.

Continuous Software Bug Prediction

Background: Many software bug prediction models have been proposed and evaluated on a set of well-known benchmark datasets. We conducted pilot studies on the widely used benchmark datasets and observed common issues among them. Specifically, most of existing benchmark datasets consist of randomly selected historical versions of software projects, which poses non-trivial threats to the validity of existing bug prediction studies since the real-world software projects often evolve continuously. Yet how to conduct software bug prediction in the real-world continuous software development scenarios is not well studied.

Aims: In this paper, to bridge the gap between current software bug prediction practice and real-world continuous software development, we propose new approaches to conduct bug prediction in real-world continuous software development regarding model building, updating, and evaluation.

Method: For model building, we propose ConBuild, which leverages distributional characteristics of bug prediction data to guide the training version selection. For model updating, we propose ConUpdate, which leverages the evolution of distributional characteristics of bug prediction data between versions to guide the reuse or update of bug prediction models in continuous software development. For model evaluation, we propose ConEA, which leverages the evolution of buggy probability of files between versions to conduct effort-aware evaluation.

Results: Experiments on 120 continuously release versions that span across six large-scale open-source software systems show the practical value of our approaches.

Conclusions: This paper provides new insights and guidelines for conducting software bug prediction in the context of continuous software development.

Evaluating the Impact of Java Virtual Machines on Energy Consumption

Background. The Java Virtual Machine (JVM) platforms have known multiple evolutions along the last decades to enhance both the performance they exhibit and the features they offer. With regards to energy consumption, few studies have investigated the energy consumption of code and data structures. Yet, we keep missing an evaluation of the energy efficiency of existing JVM platforms and an identification of the configurations that minimize the energy consumption of software hosted on the JVM.

Aims. The purpose of this paper is to investigate the variations in energy consumption between different JVM distributions and parameters to help developers configure the least consuming environment for their Java application.

Method. We thus assess the energy consumption of some of the most popular and supported JVM platforms using 12 Java benchmarks that explore different performance objectives. Moreover, we investigate the impact of the different JVM parameters and configurations on the energy consumption of software.

Results. Our results show that some JVM platforms can exhibit up to 100% more energy consumption. JVM configurations can also play a substantial role to reduce the energy consumption during the software execution. Interestingly, the default configuration of the garbage collector was energy efficient in only 50% of our experiments.

Conclusion. Finally, we provide an Open source tool, named J-Referral that recommends an energy-efficient JVM distribution and configuration for any Java application.

Facing the Giant: a Grounded Theory Study of Decision-Making in Microservices Migrations

Background: Microservices migrations are challenging and expensive projects with many decisions that need to be made in a multitude of dimensions. Existing research tends to focus on technical issues and decisions (e.g., how to split services). Equally important organizational or business issues and their relations with technical aspects often remain out of scope or on a high level of abstraction.

Aims: In this study, we aim to holistically chart the decision-making that happens on all dimensions of a migration project towards microservices (including, but not limited to, the technical dimension).

Method: We investigate 16 different migration cases in a grounded theory interview study, with 19 participants that recently migrated towards microservices. This study strongly focuses on the human aspects of a migration, through stakeholders and their decisions.

Results: We identify 3 decision-making processes consisting of 22 decision-points and their alternative options. The decision-points are related to creating stakeholder engagement and assessing feasibility, technical implementation, and organizational restructuring.

Conclusions: Our study provides an initial theory of decisionmaking in migrations to microservices. It also outfits practitioners with a roadmap of which decisions they should be prepared to make and at which point in the migration.

Promises and Perils of Inferring Personality on GitHub

Background: Personality plays a pivotal role in our understanding of human actions and behavior. Today, the applications of personality are widespread, built on the solutions from psychology to infer personality. Aim: In software engineering, for instance, one widely used solution to infer personality uses textual communication data. As studies on personality in software engineering continue to grow, it is imperative to understand the performance of these solutions. Method: This paper compares the inferential ability of three widely studied text-based personality tests against each other and the ground truth on GitHub. We explore the challenges and potential solutions to improve the inferential ability of personality tests. Results: Our study shows that solutions for inferring personality are far from being perfect. Software engineering communications data can infer individual developer personality with an average error rate of 41%. In the best case, the error rate can be reduced up to 36% by following our recommendations1.

Public Software Development Activity During the Pandemic

Background The emergence of the COVID-19 pandemic has impacted all human activity, including software development. Early reports seem to indicate that the pandemic may have had a negative effect on software developers, socially and personally, but that their software development productivity may not have been negatively impacted. Aims: Early reports about the effects of the pandemic on software development focused on software developers' well-being and on their productivity as employees. We are interested in a different aspect of software development: the developers' public contributions, as seen in GitHub and Stack Overflow activities. Did the pandemic affect the developers' public contributions and, of so, in what way? Method: Considering the data from between 2017 and till 2020, we study the trends within GitHub's push, create, pull request, and release events, and within Stack Overflow's new users, posts, votes, and comments. We performed linear regressions, correlation analyses, outlier analyses, hypothesis testing, and we also contacted individual developers in order to gather qualitative insights about their unusual public contributions. Results: Our study shows that within GitHub and Stack Overflow, the onset of the pandemic (March/April 2020) is reflected in a set of outliers in developers' contributions that point to an increase in activity. The distributions of contributions during the entire year of 2020 were, in some aspects, different, but, in other aspects, similar from the recent past. Additionally, we found one noticeably disrupted pattern of contribution in Stack Overflow, namely the ratio Questions/Answers, which was much higher in 2020 than before. Testimonials from the developers we contacted were mixed: while some developers reported that their increase in activity was due to the pandemic, others reported that it was not. Conclusion: In Github, there was a noticeable increase in public software development activity in 2020, as well as more abrupt changes in daily activities; in Stack Overflow, there was a noticeable increase in new users and new questions at the onset of the pandemic, and in the ratio of Questions/Answers during 2020. The results may be attributed to the pandemic, but other factors could have come into play.

Security Smells Pervade Mobile App Servers

[Background] Web communication is universal in cyberspace, and security risks in this domain are devastating. [Aims] We analyzed the prevalence of six security smells in mobile app servers, and we investigated the consequence of these smells from a security perspective. [Method] We used an existing dataset that includes 9 714 distinct URLs used in 3 376 Android mobile apps. We exercised these URLs twice within 14 months and investigated the HTTP headers and bodies. [Results] We found that more than 69% of tested apps suffer from three kinds of security smells, and that unprotected communication and misconfigurations are very common in servers. Moreover, source-code and version leaks, or the lack of update policies expose app servers to security risks. [Conclusions] Poor app server maintenance greatly hampers security.

Tackling Consistency-related Design Challenges of Distributed Data-Intensive Systems: An Action Research Study

Background: Distributed data-intensive systems are increasingly designed to be only eventually consistent. Persistent data is no longer processed with serialized and transactional access, exposing applications to a range of potential concurrency anomalies that need to be handled by the application itself. Controlling concurrent data access in monolithic systems is already challenging, but the problem is exacerbated in distributed systems. To make it worse, only little systematic engineering guidance is provided by the software architecture community regarding this issue. Aims: In this paper, we report on our study of the effectiveness and applicability of the novel design guidelines we are proposing in this regard. Method: We used action research and conducted it in the context of the software architecture design process of a multi-site platform development project. Results: Our hypotheses regarding effectiveness and applicability have been accepted in the context of the study. The initial design guidelines were refined throughout the study. Thus, we also contribute concrete guidelines for architecting distributed data-intensive systems with eventually consistent data. The guidelines are an advancement of Domain-Driven Design and provide additional patterns for the tactical design part. Conclusions: Based on our results, we recommend using the guidelines to architect safe eventually consistent systems. Because of the relevance of distributed data-intensive systems, we will drive this research forward and evaluate it in further domains.

Testing Smart Contracts: Which Technique Performs Best?

Background: Executing, verifying and enforcing credible transactions on permissionless blockchains is done using smart contracts. A key challenge with smart contracts is ensuring their correctness and security. Several test input generation techniques for detecting vulnerabilities in smart contracts have been proposed in the last few years. However, a comparison of proposed techniques to gauge their effectiveness is missing. Aim: This paper conducts an empirical evaluation of testing techniques for smart contracts. The testing techniques we evaluated are: (1) Blackbox fuzzing, (2) Adaptive fuzzing, (3) Coverage-guided fuzzing with an SMT solver and (4) Genetic algorithm. We do not consider static analysis tools, as several recent studies have assessed and compared effectiveness of these tools. Method: We evaluate effectiveness of the test generation techniques using (1) Coverage achieved - we use four code coverage metrics targeting smart contracts, (2) Fault finding ability - using artificially seeded and real security vulnerabilities of different types. We used two datasets in our evaluation - one with 1665 real smart contracts from Etherscan, and another with 90 real contracts with known vulnerabilities to assess fault finding ability. Result: We find Adaptive fuzzing performs best in terms of coverage and fault finding over contracts in both datasets. Conclusion: However, we believe considering dependencies between functions and handling Solidity specific features will help improve the performance of all techniques considerably.

The Existence and Co-Modifications of Code Clones within or across Microservices

In recent years, microservice architecture has been widely applied in software design. In addition, more and more monolithic software systems have been migrated into a microservice architecture. The core idea is to decompose the concerns of software projects into small and loosely-coupled services. Each service is supposed to be developed and even managed independently, which in turn improving the efficiency of development and maintenance. Code clone is common during software implementations, and many prior studies have revealed that code clones could cause maintenance difficulties. However, there is little work exploring the impacts of code clones on microservice projects. To bridge this gap, we focus on exploring the existence and co-modifications of within-service and cross-service code clones. With our evaluation of eight microservice projects, we have presented that there still exist code clones within services or across services. In addition, both within-service and cross-service code clones have been involved in co-modifications, meaning that these clones have caused maintenance difficulties. Finally, we have explored the characteristics of co-modifications in terms of changed LOC for both within-service and cross-service code clones.

Towards a Human Values Dashboard for Software Development: An Exploratory Study

Background: There is a growing awareness of the importance of human values (e.g., inclusiveness, privacy) in software systems. However, there are no practical tools to support the integration of human values during software development. We argue that a tool that can identify human values from software development artefacts and present them to varying software development roles can (partially) address this gap. We refer to such a tool as human values dashboard. Further to this, our understanding of such a tool is limited. Aims: This study aims to (1) investigate the possibility of using a human values dashboard to help address human values during software development, (2) identify possible benefits of using a human values dashboard, and (3) elicit practitioners' needs from a human values dashboard. Method: We conducted an exploratory study by interviewing 15 software practitioners. A dashboard prototype was developed to support the interview process. We applied thematic analysis to analyse the collected data. Results: Our study finds that a human values dashboard would be useful for the development team (e.g., project manager, developer, tester). Our participants acknowledge that development artefacts, especially requirements documents and issue discussions, are the most suitable source for identifying values for the dashboard. Our study also yields a set of high-level user requirements for a human values dashboard (e.g., it shall allow determining values priority of a project). Conclusions: Our study suggests that a values dashboard is potentially used to raise awareness of values and support values-based decision-making in software development. Future work will focus on addressing the requirements and using issue discussions as potential artefacts for the dashboard.

What Evidence We Would Miss If We Do Not Use Grey Literature?

Context: Multivocal Literature Reviews (MLR) search for evidence in both Traditional Literature (TL) and Grey Literature (GL). Despite the growing interest in MLR-based studies, the literature assessing how GL has contributed to MLR studies is still scarce. Objective: This research aims to assess how the use of GL contributed to MLR studies. By contributing, we mean, understanding to what extent GL is providing evidence that is indeed used by an MLR to answer its research question. Method: We start by conducting a tertiary study to identify MLR studies published between 2017 and 2019, selecting nine of them. We then identified the GL used in these studies and assessed to what extent the GLs are providing evidence that help these studies to answer their research questions. Results: Our analysis identified that 1) GL provided evidence not found in TL, 2) most of the GL sources were used to provide recommendations to solve problems, explain a topic, and classify the findings, and 3) 19 different GL types were used in the studies; these GLs were mainly produced by SE practitioners (including blog posts, slides presentations, or project descriptions). Conclusions: We evidence how GL contributed to MLR studies. We observed that if these GLs were not included in the MLR, several findings would have been omitted or weakened. We also described the challenges involved when conducting this investigation, along with potential ways to deal with them, which may help future SE researchers.

Who are Vulnerability Reporters?: A Large-scale Empirical Study on FLOSS

(Background) Software vulnerabilities pose a serious threat to the security of computer systems. Hence, there is a constant race for defenders to find and patch them before attackers are able to exploit them. Measuring different aspects of this process is important in order to better understand it and improve the odds for defenders. (Aims) The human factor of the vulnerability discovery and patching process has received limited attention. Better knowledge of the characteristics of the people and organizations who discover and report security vulnerabilities can considerably enhance our understanding of the process, provide insights regarding the expended effort in vulnerability hunting, contribute to better security metrics, and help guide practical decisions regarding the strategy of projects to attract vulnerability researchers.

(Method) In this paper, we present what is, to the best of our knowledge, the first large-scale empirical study on the people and organizations who report vulnerabilities in popular FLOSS projects. Collecting data from a multitude of publicly available sources (NVD, bug-tracking platforms, vendor advisories, source code repositories), we create a dataset of reporter information for 2193 unique reporting entities of 4756 CVEs affecting the Mozilla suite, Apache httpd, the PHP interpreter, and the Linux kernel. We use the dataset to investigate several aspects of the vulnerability discovery process, specifically regarding the distribution of contributions, their temporal characteristics, and the motivations of reporters.

(Results) Among our results: around 80% of reports come from 20% of reporters; first time reporters are significant contributors to the yearly total in all 4 projects; productive reporters are specialized w.r.t. the project and vulnerability types; around half of all reports come from reporters acknowledging an affiliation.

(Conclusions) Projects depend both on a core of dedicated and productive reporters, and on small contributions from a large number of community reporters. The generalized Pareto principle (the (1 - p)/p law) can be used as a metric for the concentration of contributions in the vulnerability-reporting ecosystem of a project.

Why Do Organizations Adopt Agile Scaling Frameworks?: A Survey of Practitioners

Background: The benefits of agile methods in small, co-located projects have inspired their adoption in large firms and projects. Scaling frameworks, such as Large-Scale Scrum (LeSS) and the Scaled Agile Framework (SAFe), have been proposed by practitioners to scale agile to larger contexts, and become rather widely adopted in the industry. Despite the popularity of the frameworks, the knowledge on the reasons, expected benefits, and satisfaction of organizations adopting them is still limited.

Aims: This paper presents a study of practitioners who have adopted an agile scaling framework in their organization and investigates the reasons for, expected benefits of, and the satisfaction level with the adoption of the selected framework.

Method: We conducted a survey of software practitioners. We received data from 204 respondents representing ten frameworks adopted in 26 countries and located in six continents.

Results: The results show that SAFe is the most widely adopted framework among our respondents. The two most commonly mentioned reasons for adopting agile scaling frameworks are to scale to more people and to remain competitive in the market. The most common expected benefits are improving the collaboration and dependency management between teams. We also found some unique reasons and expected benefits for the framework adoption, such as inculcating an agile mindset, addressing the needs of regulated environments, dissolving silos, and technical excellence. Our findings indicated statistically significant differences for reasons, expected benefits, and satisfaction between different frameworks. Most of our respondents report that the selected framework met their expectations.

Conclusions: This paper offers the first quantitative assessment of reasons, expected benefits, and satisfaction of firms for adopting agile scaling frameworks. Future studies comparing scaling frameworks could help firms in selecting the most suitable framework fitting their needs.

SESSION: Emerging Results and Vision Papers

A Rubric to Identify Misogynistic and Sexist Texts from Software Developer Communications

Background: As contemporary software development organizations are dominated by males, occurrences of misogynistic and sexist remarks are abundant in many communities. Such remarks are barriers to promoting diversity and inclusion in the software engineering (SE) domain.

Aims: This study aims to develop a rubric to identify misogynistic remarks and sexist jokes specifically from software developer communications.

Method: We have followed the systematic literature review protocol to identify 10 primary studies that have characterized misogynistic and sexist texts in various domains.

Results: Based on our syntheses of the primary studies, we have developed a rubric to manually identity various categories of misogynistic or sexist remarks. We have also provided SE domain specific examples of those categories.

Conclusions: Our annotation guideline will pave the path towards building automated misogynistic text classifier for the SE domain.

Contextual Understanding and Improvement of Metamorphic Testing in Scientific Software Development

Background: Metamorphic testing emerges as a simple and effective approach for testing scientific software; yet, its adoption in actual scientific software projects is less studied.

Aims: In order for the practitioners to better adopt metamorphic testing in their projects, we set out to first gain a deep understanding about the current qualify assurance workflow, testing practices, and tools.

Method: We propose to integrate various empirical sources, including artifact analysis, stakeholder interviews, and gap analysis from the literature.

Results: Applying our approach to the Open Water Analytics Stormwater Management Model project helped to identify four new needs requiring continued and more research: (1) systematic and explicit formulation of metamorphic relations, (2) metamorphic testing examples specific to the scientific software, (3) correlating metamorphic testing with regression testing, and (4) integrating metamorphic testing with build tools like CMake and continuous integration tools like GitHub Actions.

Conclusions: Integrating different empirical sources is promising for establishing a contextual understanding of software engineering practices, and for action research, such as workflow refinements and tool interventions, to be carried out in a principled manner.

Important Experimentation Characteristics: An Expert Survey

Background: Recent empirical studies indicate that online controlled experimentation is rarely systematically applied. A structured and complete experiment definition is the basis for systematic and trustworthy experimentation. Aims: As the first step towards guidelines for the definition of experiments, we explore experimentation application types that are conducted in practice. Additionally, we identify experiment definition characteristics that experts regard to considerable contribute to trustworthy experimentation. Method: An expert survey among fifteen industrial experts that published peer-reviewed publications was conducted. Results: In total, we identified fourteen types of applications with 32 concrete applications among the answers of the experts. The most frequently mentioned characteristics regarding the accuracy of an experiment were: success metrics, hypothesis, data quality metrics, guardrail metrics, alerting & shutdown, sizing, and segmentation. Conclusions: There are various applications for experimentation besides the ones mentioned in the literature. Most experts consider only about half of the known characteristics as relevant for the accuracy of an experiment.

Inclusion and Exclusion Criteria in Software Engineering Tertiary Studies: A Systematic Mapping and Emerging Framework

Context: Tertiary studies in software engineering (TS@SE) are widely used to synthesise evidence on a research topic systematically. As part of their protocol, TS@SE define inclusion and exclusion criteria (IC/EC) aimed at selecting those secondary studies (SS) to be included in the analysis. Aims: To provide a state of the art on the definition and application of IC/EC in TS@SE, and from the results of this analysis, we outline an emerging framework, TSICEC, to be used by SE researchers. Method: To provide the state of the art, we conducted a systematic mapping (SM) combining automatic search and snowballing over the body of SE scientific literature, which led to 50 papers after application of our own IC/EC. The extracted data was synthesised using content analysis. The results were used to define a first version of TSICEC. Results: The SM resulted in a coding schema, and a thorough analysis of the selected papers on the basis of this coding. Our TSICEC framework includes guidelines for the definition of IC/EC in TS@SE. Conclusion: This paper is a step forward establishing a foundation for researchers in two ways. As authors, understanding the different possibilities to define IC/EC and apply them to select SS. As readers, having an instrument to understand the methodological rigor upon which TS@SE may claim their findings.

Python Crypto Misuses in the Wild

Background: Previous studies have shown that up to 99.59 % of the Java apps using crypto APIs misuse the API at least once. However, these studies have been conducted on Java and C, while empirical studies for other languages are missing. For example, a controlled user study with crypto tasks in Python has shown that 68.5 % of the professional developers write a secure solution for a crypto task. Aims: To understand if this observation holds for real-world code, we conducted a study of crypto misuses in Python. Method: We developed a static analysis tool that covers common misuses of 5 different Python crypto APIs. With this analysis, we analyzed 895 popular Python projects from GitHub and 51 MicroPython projects for embedded devices. Further, we compared our results with the findings of previous studies. Results: Our analysis reveals that 52.26 % of the Python projects have at least one misuse. Further, some Python crypto libraries' API design helps developers from misusing crypto functions, which were much more common in studies conducted with Java and C code. Conclusion: We conclude that we can see a positive impact of the good API design on crypto misuses for Python applications. Further, our analysis of MicroPython projects reveals the importance of hybrid analyses.

Semantic Slicing of Architectural Change Commits: Towards Semantic Design Review

Software architectural changes involve more than one module or component and are complex to analyze compared to local code changes. Development teams aiming to review architectural aspects (design) of a change commit consider many essential scenarios such as access rules and restrictions on usage of program entities across modules. Moreover, design review is essential when proper architectural formulations are paramount for developing and deploying a system. Untangling architectural changes, recovering semantic design, and producing design notes are the crucial tasks of the design review process. To support these tasks, we construct a lightweight tool [4] that can detect and decompose semantic slices of a commit containing architectural instances. A semantic slice consists of a description of relational information of involved modules, their classes, methods and connected modules in a change instance, which is easy to understand to a reviewer. We extract various directory and naming structures (DANS) properties from the source code for developing our tool. Utilizing the DANS properties, our tool first detects architectural change instances based on our defined metric and then decomposes the slices (based on string processing). Our preliminary investigation with ten open-source projects (developed in Java and Kotlin) reveals that the DANS properties produce highly reliable precision and recall (93-100%) for detecting and generating architectural slices. Our proposed tool will serve as the preliminary approach for the semantic design recovery and design summary generation for the project releases.

Study of the Utility of Text Classification Based Software Architecture Recovery Method RELAX for Maintenance

Background. The software architecture recovery method RELAX produces a concern-based architectural view of a software system graphically and textually from that system's source code. The method has been implemented in software which can recover the architecture of systems whose source code is written in Java. Aims. Our aim was to find out whether the availability of architectural views produced by RELAX can help maintainers who are new to a project in becoming productive with development tasks sooner, and how they felt about working in such an environment. Method. We conducted a user study with nine participants. They were subjected to a controlled experiment in which maintenance success and speed with and without access to RELAX recovery results were compared to each other. Results. We have observed that employing architecture views produced by RELAX helped participants reduce time to get started on maintenance tasks by a factor of 5.38 or more. While most participants were unable to finish their tasks within the allotted time when they did not have recovery results available, all of them finished them successfully when they did. Additionally, participants reported that these views were easy to understand, helped them to learn the system's structure and enabled them to compare different versions of the system. Conclusions. Through the speedup to the start of maintenance experienced by the participants as well as in their formed opinions, RELAX has shown itself to be a valuable help that could provide the basis of further tools that specifically support the development process with a focus on maintenance.

Towards Sustainability of Systematic Literature Reviews

Background: The software engineering community has increasingly conducted systematic literature reviews (SLR) as a means to summarize evidence from different studies and bring to light the state of the art of a given research topic. While SLR provide many benefits, they also present several problems with punctual solutions for some of them. However, two main problems still remain: the high time-/effort-consumption nature of SLR and the lack of an effective impact of SLR results in the industry, as initially expected for SLR. Aims: The main goal of this paper is to introduce a new view --- which we name Sustainability of SLR --- on how to deal with SLR aiming at reducing those problems. Method: We analyzed six reference studies published in the last decade to identify, group, and analyze the SLR problems and their interconnections. Based on such analysis, we proposed the view of Sustainability of SLR that intends to address these problems. Results: The proposed view encompasses three dimensions (social, economic, and technical) that could become SLR more sustainable in the sense that the four major problems and 31 barriers (i.e., possible causes for those problems) that we identified could be mitigated. Conclusions: The view of Sustainability of SLR intends to change the researchers' mindset to mitigate the inherent SLR problems and, as a consequence, achieve sustainable SLR, i.e., those that consume less time/effort to be conducted and updated with useful results for the industry.

Towards a Methodology for Participant Selection in Software Engineering Experiments: A Vision of the Future

Background. Software Engineering (SE) researchers extensively perform experiments with human subjects. Well-defined samples are required to ensure external validity. Samples are selected purposely or by convenience, limiting the generalizability of results. Objective. We aim to depict the current status of participants selection in empirical SE, identifying the main threats and how they are mitigated. We draft a robust approach to participants' selection. Method. We reviewed existing participants' selection guidelines in SE, and performed a preliminary literature review to find out how participants' selection is conducted in SE in practice. Results. We outline a new selection methodology, by 1) defining the characteristics of the desired population, 2) locating possible sources of sampling available for researchers, and 3) identifying and reducing the "distance" between the selected sample and its corresponding population. Conclusion. We propose a roadmap to develop and empirically validate the selection methodology.

Vision for an Artefact-based Approach to Regulatory Requirements Engineering

Background: Nowadays, regulatory requirements engineering (regulatory RE) faces challenges of interdisciplinary nature that cannot be tackled due to existing research gaps. Aims: We envision an approach to solve some of the challenges related to the nature and complexity of regulatory requirements, the necessity for domain knowledge, and the involvement of legal experts in regulatory RE. Method: We suggest the qualitative analysis of regulatory texts combined with the further case study to develop an empirical foundation for our research. Results: We outline our vision for the application of extended artefact-based modeling for regulatory RE. Conclusions: Empirical methodology is an essential instrument to address interdisciplinarity and complexity in regulatory RE. Artefact-based modeling supported by empirical results can solve a particular set of problems while not limiting the application of other methods and tools and facilitating the interaction between different fields of practice and research.

Web Application Testing: Using Tree Kernels to Detect Near-duplicate States in Automated Model Inference

Background: In the context of End-to-End testing of web applications, automated exploration techniques (a.k.a. crawling) are widely used to infer state-based models of the site under test. These models, in which states represent features of the web application and transitions represent reachability relationships, can be used for several model-based testing tasks, such as test case generation. However, current exploration techniques often lead to models containing many near-duplicate states, i.e., states representing slightly different pages that are in fact instances of the same feature. This has a negative impact on the subsequent model-based testing tasks, adversely affecting, for example, size, running time, and achieved coverage of generated test suites. Aims: As a web page can be naturally represented by its tree-structured DOM representation, we propose a novel near-duplicate detection technique to improve the model inference of web applications, based on Tree Kernel (TK) functions. TKs are a class of functions that compute similarity between tree-structured objects, largely investigated and successfully applied in the Natural Language Processing domain. Method: To evaluate the capability of the proposed approach in detecting near-duplicate web pages, we conducted preliminary classification experiments on a freely-available massive dataset of about 100k manually annotated web page pairs. We compared the classification performance of the proposed approach with other state-of-the-art near-duplicate detection techniques. Results: Preliminary results show that our approach performs better than state-of-the-art techniques in the near-duplicate detection classification task. Conclusions: These promising results show that TKs can be applied to near-duplicate detection in the context of web application model inference, and motivate further research in this direction to assess the impact of the technique on the quality of the inferred models and on the subsequent application of model-based testing techniques.

Why Some Bug-bounty Vulnerability Reports are Invalid?: Study of bug-bounty reports and developing an out-of-scope taxonomy model

Background: Despite the increasing popularity of bug-bounty platforms in industry, little empirical evidence exists to identify the nature of invalid vulnerability reports. Mitigation of invalid reports is a serious concern of organisations running or using bug-bounty platforms as well as security researchers. Aims: In this work we aim to identify: (i) why some reports are considered as invalid? (ii) what are the characteristics of reports considered as invalid due to being out-of-scope? Method: We conducted an empirical study on disclosed invalid reports in HackerOne to examine the reasons these reports are marked as invalid and we found that out-of-scope is the leading reason. Since all out-of-scope reports were rejected according to the programs policy page we studied all programs policy pages in two major bug-bounty platforms to understand the characteristics of an out-of-scope report. We developed a generalised out-of-scope taxonomy model and we used our model to further analyse HackerOne out-of-scope reports to find the leading attributes of this model that contributes to the fate of these reports. Results: We identified out-of-scope followed by false-positive as two main reasons for a report to be deemed invalid. We found that the attribute of vulnerability type in our taxonomy model is the leading characteristic of out-of-scope reports. We also identified the top 9 out-of-interest vulnerability types according to policy pages. Conclusions: Our study can help bug-bounty platforms and researchers to better understand the nature of invalid reports. Our finding about the importance of vulnerability type in validating reports can be used to justify future works to develop automated classification techniques based on vulnerability types to better triage invalid reports. Our top 9 out-of-interest vulnerability types can be used as a blacklist to automatically classify possibly an out-of-scope report. Finally our generalised out-of-scope taxonomy model can guide organisations as a base model to create their policy page and tailor it as they need.