ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

Full Citation in the ACM Digital Library

SESSION: Technical Papers

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Background: Meeting the growing industry demand for Data Science requires cross-disciplinary teams that can translate machine learning research into production-ready code. Software engineering teams value adherence to coding standards as an indication of code readability, maintainability, and developer expertise. However, there are no large-scale empirical studies of coding standards focused specifically on Data Science projects. Aims: This study investigates the extent to which Data Science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? Method: We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity. Results: Data Science projects suffer from a significantly higher rate of functions that use an excessive numbers of parameters and local variables. Data Science projects also follow different variable naming conventions to non-Data Science projects. Conclusions: The differences indicate that Data Science codebases are distinct from traditional software codebases and do not follow traditional software engineering conventions. Our conjecture is that this may be because traditional software engineering conventions are inappropriate in the context of Data Science projects.

A Survey on the Interplay between Software Engineering and Systems Engineering during SoS Architecting

Background: The Systems Engineering and Software Engineering disciplines are highly intertwined in most modern Systems of Systems (SoS), and particularly so in industries such as defense, transportation, energy and health care. However, the combination of these disciplines during the architecting of SoS seems to be especially challenging; the literature suggests that major integration and operational issues are often linked to ambiguities and gaps between system-level and software-level architectures.

Aims: The objective of this paper is to empirically investigate: 1) the state of practice on the interplay between these two disciplines in the architecting process of systems with SoS characteristics; 2) the problems perceived due to this interplay during said architecting process; and 3) the problems arising due to the particular characteristics of SoS systems.

Method: We conducted a questionnaire-based online survey among practitioners from industries in the aforementioned domains, having a background on Systems Engineering, Software Engineering or both, and experience in the architecting of systems with SoS characteristics. The survey combined multiple-choice and open-ended questions, and the data collected from the 60 respondents were analyzed using quantitative and qualitative methods.

Results: We found that although in most cases the software architecting process is governed by system-level requirements, the way requirements were specified by systems engineers, and the lack of domain-knowledge of software engineers, often lead to misinterpretations at software level. Furthermore, we found that unclear and/or incomplete specifications could be a common cause of technical debt in SoS projects, which is caused, in part, by insufficient interface definitions. It also appears that while the SoS concept has been adopted by some practitioners in the field, the same is not true about the existing and growing body of knowledge on the subject in Software Engineering resulting in recurring problems with system integration. Finally, while not directly related to the interplay of the two disciplines, the survey also indicates that low-level hardware components, despite being identified as the root cause of undesired emergent behavior, are often not considered when modeling or simulating the system.

Conclusions: The survey indicates the need for tighter collaboration between the two disciplines, structured around concrete guidelines and practices for reconciling their differences. A number of open issues identified by this study require further investigation.

Adoption and Effects of Software Engineering Best Practices in Machine Learning

Background. The increasing reliance on applications with machine learning (ML) components calls for mature engineering techniques that ensure these are built in a robust and future-proof manner.

Aim. We aim to empirically determine the state of the art in how teams develop, deploy and maintain software with ML components.

Method. We mined both academic and grey literature and identified 29 engineering best practices for ML applications. We conducted a survey among 313 practitioners to determine the degree of adoption for these practices and to validate their perceived effects. Using the survey responses, we quantified practice adoption, differentiated along demographic characteristics, such as geography or team size. We also tested correlations and investigated linear and non-linear relationships between practices and their perceived effect using various statistical models.

Results. Our findings indicate, for example, that larger teams tend to adopt more practices, and that traditional software engineering practices tend to have lower adoption than ML specific practices. Also, the statistical models can accurately predict perceived effects such as agility, software quality and traceability, from the degree of adoption for specific sets of practices. Combining practice adoption rates with practice importance, as revealed by statistical models, we identify practices that are important but have low adoption, as well as practices that are widely adopted but are less important for the effects we studied.

Conclusion. Overall, our survey and the analysis of responses received provide a quantitative basis for assessment and step-wise improvement of practice adoption by ML teams.

An Empirical Study of Software Exceptions in the Field using Search Logs

Background: Software engineers spend a substantial amount of time using Web search to accomplish software engineering tasks. Such search tasks include finding code snippets, API documentation, seeking help with debugging, etc. While debugging a bug or crash, one of the common practices of software engineers is to search for information about the associated error or exception traces on the internet. Aims: In this paper, we analyze query logs from Bing to carry out a large scale study of software exceptions. To the best of our knowledge, this is the first large scale study to analyze how Web search is used to find information about exceptions. Method: We analyzed about 1 million exception related search queries from a random sample of 5 billion web search queries. To extract exceptions from unstructured query text, we built a novel machine learning model. With the model, we extracted exceptions from raw queries and performed popularity, effort, success, query characteristic and web domain analysis. We also performed programming language-specific analysis to give a better view of the exception search behavior. Results: Using the model with an F1-score of 0.82, our study identifies most frequent, most effort-intensive, or less successful exceptions and popularity of community Q&A sites. Conclusion: These techniques can help improve existing methods, documentation and tools for exception analysis and prediction. Further, similar techniques can be applied for APIs, frameworks, etc.

An Empirical Validation of Cognitive Complexity as a Measure of Source Code Understandability

Background: Developers spend a lot of their time on understanding source code. Static code analysis tools can draw attention to code that is difficult for developers to understand. However, most of the findings are based on non-validated metrics, which can lead to confusion and code that is hard to understand not being identified.

Aims: In this work, we validate a metric called Cognitive Complexity which was explicitly designed to measure code understandability and which is already widely used due to its integration in well-known static code analysis tools.

Method: We conducted a systematic literature search to obtain data sets from studies which measured code understandability. This way we obtained about 24,000 understandability evaluations of 427 code snippets. We calculated the correlations of these measurements with the corresponding metric values and statistically summarized the correlation coefficients through a meta-analysis.

Results: Cognitive Complexity positively correlates with comprehension time and subjective ratings of understandability. The metric showed mixed results for the correlation with the correctness of comprehension tasks and with physiological measures.

Conclusions: It is the first validated and solely code-based metric which is able to reflect at least some aspects of code understandability. Moreover, due to its methodology, this work shows that code understanding is currently measured in many different ways, which we also do not know how they are related. This makes it difficult to compare the results of individual studies as well as to develop a metric that measures code understanding in all its facets.

Case Survey Studies in Software Engineering Research

Background: Given the social aspects of Software Engineering (SE), in the last twenty years, researchers from the field started using research methods common in social sciences such as case study, ethnography, and grounded theory. More recently, case survey, another imported research method, has seen its increasing use in SE studies. It is based on existing case studies reported in the literature and intends to harness the generalizability of survey and the depth of case study. However, little is known on how case survey has been applied in SE research, let alone guidelines on how to employ it properly. Aims: This article aims to provide a better understanding of how case survey has been applied in Software Engineering research. Method: To address this knowledge gap, we performed a systematic mapping study and analyzed 12 Software Engineering studies that used the case survey method. Results: Our findings show that these studies presented a heterogeneous understanding of the approach ranging from secondary studies to primary inquiries focused on a large number of instances of a research phenomenon. They have not applied the case survey method consistently as defined in the seminal methodological papers. Conclusions: We conclude that a set of clearly defined guidelines are needed on how to use case survey in SE research, to ensure the quality of the studies employing this approach and to provide a set of clearly defined criteria to evaluate such work.

Challenges in Docker Development: A Large-scale Study Using Stack Overflow

Background: Docker technology has been increasingly used among software developers in a multitude of projects. This growing interest is due to the fact that Docker technology supports a convenient process for creating and building containers, promoting close cooperation between developer and operations teams, and enabling continuous software delivery. As a fast-growing technology, it is important to identify the Docker-related topics that are most popular as well as existing challenges and difficulties that developers face.

Aims: This paper presents a large-scale empirical study identifying practitioners' perspectives on Docker technology by mining posts from the Stack Overflow (SoF) community. Method: A dataset of 113, 922 Docker-related posts was created based on a set of relevant tags and contents. The dataset was cleaned and prepared. Topic modelling was conducted using Latent Dirichlet Allocation (LDA), allowing the identification of dominant topics in the domain. Results: Our results show that most developers use SoF to ask about a broad spectrum of Docker topics including framework development, application deployment, continuous integration, web-server configuration and many more. We determined that 30 topics that developers discuss can be grouped into 13 main categories. Most of the posts belong to categories of application development, configuration, and networking. On the other hand, we find that the posts on monitoring status, transferring data, and authenticating users are more popular among developers compared to the other topics. Specifically, developers face challenges in web browser issues, networking error and memory management. Besides, there is a lack of experts in this domain. Conclusion: Our research findings will guide future work on the development of new tools and techniques, helping the community to focus efforts and understand existing trade-offs on Docker topics.

Characterizing Energy Consumption of Third-Party API Libraries using API Utilization Profiles

Background: Third-party software libraries often serve as fundamental building blocks for developing applications. However, depending on such libraries for development raises a new concern, energy consumption, which has become of increased interest for the software engineering community in recent years. Understanding the energy implications of software design choices is an ongoing research challenge, especially when working with mobile devices that are constrained by limited battery life. Aims: Our goal is to research approaches, which will support software developers to better comprehend the energy implications of software design choices on Android. For this study, we particularly focus on APIs from third-party libraries for which we research methods that will enable estimating the energy consumption without the need for specific measurement hardware and laborious measurements. Method: To achieve the stipulated goal we introduce System API Utilization Profiles (uAPI) which are based on the general assumption that the actual energy consumption of a library is directly tied to its utilization of the underlying System API. We provide a formal definition and implementation of the proposed uAPI profiles that are calculated based on dynamic call graphs obtained from a library under test. To further show the connection between our proposed uAPI profiles and energy consumption, we empirically examined their correlation using two experiments. The first one is dedicated to Android I/O operations, the second one examines uAPI profiles based on a popular open-source library for JSON document processing. Results: In our empirical evaluation, we collected 1052 individual call graphs which we used as an input to evaluate the our model and compare it with the actual energy consumption measured. For each call graph, we measured the energy consumption with special hardware and attributed the measurements to individual methods. Based on that data, we examined a strong linear correlation between the proposed uAPI profiles and the actual measured energy consumption. Conclusions: uApi profiles serve as a lightweight and feasible approach to characterize the energy characteristics of third-party libraries. Their computation does not involve any special hardware, and there are no alterations on the target mobile device required. For future work, we investigate possibilities to use the proposed uAPI profiles as a foundation for a regression model, which would allow us to predict the energy consumption development of a library over time.

DevOps in an ISO 13485 Regulated Environment: A Multivocal Literature Review

Background: Medical device development projects must follow proper directives and regulations to be able to market and sell the end-product in their respective territories. The regulations describe requirements that seem to be opposite to efficient software development and short time-to-market. As agile approaches, like DevOps, are becoming more and more popular in software industry, a discrepancy between these modern methods and traditional regulated development has been reported. Although examples of successful adoption in this context exist, the research is sparse. Aims: The objective of this study is twofold: to review the current state of DevOps adoption in regulated medical device environment; and to propose a checklist based on that review for introducing DevOps in that context. Method: A multivocal literature review is performed and evidence is synthesized from sources published between 2015 to March of 2020 to capture the opinions of experts and community in this field. Results: Our findings reveal that adoption of DevOps in a regulated medical device environment such as ISO 13485 has its challenges, but potential benefits may outweigh those in areas such as regulatory, compliance, security, organizational and technical. Conclusion: DevOps for regulated medical device environments is a highly appealing approach as compared to traditional methods and could be particularly suited for regulated medical development. However, an organization must properly anchor a transition to DevOps in top-level management and be supportive in the initial phase utilizing professional coaching and space for iterative learning; as such an initiative is a complex organizational and technical task.

funcGNN: A Graph Neural Network Approach to Program Similarity

Background: Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. Aim: This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. Method: We introduce funcGNN1, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between highlevel language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (1.94 x10-3), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with state of the art methods. Conclusion: funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions.

Effect of Technical and Social Factors on Pull Request Quality for the NPM Ecosystem

Background: Pull request (PR) based development, which is a norm for the social coding platforms, entails the challenge of evaluating the contributions of, often unfamiliar, developers from across the open source ecosystem and, conversely, submitting a contribution to a project with unfamiliar maintainers. Previous studies suggest that the decision of accepting or rejecting a PR may be influenced by a diverging set of technical and social factors, but often focus on relatively few projects, do not consider ecosystem-wide measures, or the possible non-monotonic relationships between the predictors and PR acceptance probability. Aim: We aim to shed light on this important decision making process by testing which measures significantly affect the probability of PR acceptance on a significant fraction of a large ecosystem, rank them by their relative importance in predicting PR acceptance, and determine the shape of the functions that map each predictor to PR acceptance. Method: We proposed seven hypotheses regarding which technical and social factors might affect PR acceptance and created 17 measures based on them. Our dataset consisted of 470,925 PRs from 3349 popular NPM packages and 79,128 GitHub users who created those. We tested which of the measures affect PR acceptance and ranked the significant measures by their importance in a predictive model. Results: Our predictive model had and AUC of 0.94, and 15 of the 17 measures were found to matter, including five novel ecosystem-wide measures. Measures describing the number of PRs submitted to a repository and what fraction of those get accepted, and signals about the PR review phase were most significant. We also discovered that only four predictors have a linear influence on the PR acceptance probability while others showed a more complicated response. Conclusion: Our findings should be helpful for PR creators, integrators, as well as tool designers to focus on the important factors affecting PR acceptance.

Learning Features that Predict Developer Responses for iOS App Store Reviews

Background: Which aspects of an iOS App Store user review motivate developers to respond? Numerous studies have been conducted to extract useful information from reviews, but limited effort has been expended to answer this question.

Aims: This work aims to investigate the potential of using a machine learning algorithm and the features that can be extracted from user reviews to model developers' response behavior. Through this process, we want to uncover the learned relationship between these features and developer responses.

Method: For our prediction, we run a random forest algorithm over the derived features. We then perform a feature importance analysis to understand the relative importance of each individual feature and groups thereof.

Results: Through a case study of eight popular apps, we show patterns in developers' response behavior. Our results demonstrate not only that rating and review length are among the most important but also that review posted time, sentiment, and the writing style play an important role in the response prediction. Additionally, the variation in feature importance ranking implies that different app developers use different feature weights when prioritizing responses.

Conclusions: Our results may provide guidance for those building review or response prioritization tools and developers wishing to prioritize their responses effectively.

Long-Term Evaluation of Technical Debt in Open-Source Software

Background: A consistent body of research and practice have identified that technical debt provides valuable and actionable insight into the design and implementation deficiencies of complex software systems. Existing software tools enable characterizing and measuring the amount of technical debt at selective granularity levels; by providing a computational model, they enable stakeholders to measure and ultimately control this phenomenon. Aims: In this paper we aim to study the evolution and characteristics of technical debt in open-source software. For this, we carry out a longitudinal study that covers the entire development history of several complex applications. The goal is to improve our understanding of how the amount and composition of technical debt changes in evolving software. We also study how new technical debt is introduced in software, as well as identify how developers handle its accumulation over the long term. Method: We carried out our evaluation using three complex, open-source Java applications. All 110 released versions, covering more than 10 years of development history for each application were analyzed using SonarQube. We studied how the amount, composition and history of technical debt changed during development, compared our results across the studied applications and present our most important findings. Results: For each application, we identified key versions during which large amounts of technical debt were added, removed or both. This had significantly more impact when compared to the lines of code or class count increases that generally occurred during development. However, within each version, we found high correlation between file lines of code and technical debt. We observed that the Pareto principle was satisfied for the studied applications, as 20% of issue types generated around 80% of total technical debt. Interestingly, there was a large degree of overlap between the issues that generated most of the debt across the studied applications. Conclusions: Early application versions showed greater fluctuation in the amount of existing technical debt. We found application size to be an unreliable predictor for the quantity of technical debt. Most debt was introduced in applications as part of milestone releases that expanded their feature set; likewise, we identified releases where extensive refactoring significantly reduced the level of debt. We also discovered that technical debt issues persist for a long time in source code, and their removal did not appear to be prioritized according to type or severity.

On Reducing the Energy Consumption of Software: From Hurdles to Requirements

Background. As software took control over hardware in many domains, the question of the energy footprint induced by the software is becoming critical for our society, as the resources powering the underlying infrastructure are finite. Yet, beyond this growing interest, energy consumption remains a difficult concept to master for a developer.

Aims. The purpose of this study is to better understand the root causes that prevent the issue of software energy consumption to be more widely considered by developers and companies.

Method. To investigate this issue, this paper reports on a qualitative study we conducted in an industrial context. We applied an in-depth analysis of the interviews of 10 experienced developers and summarized a set of implications.

Results. We argue that our study delivers i) insightful feedback on how green software design is considered among the interviewed developers and ii) a set of findings to build helpful tools, motivate further research, and establish better development strategies to promote green software design.

Conclusion. This paper covers an industrial case study of developers' awareness of green software design and how to promote it within the company. While it might not be generalizable for any company, we believe our results deliver a common body of knowledge with implications to be considered for similar cases and further researches.

On the adoption, usage and evolution of Kotlin features in Android development

Background: Google announced Kotlin as an Android official programming language in 2017, giving developers an option of writing applications using a language that combines object-oriented and functional features. Aims: The goal of this work is to understand the usage of Kotlin features considering four aspects: i) which features are adopted, ii) what is the degree of adoption, iii) when are these features added into Android applications for the first time, and iv) how the usage of features evolves along with applications' evolution. Method: Exploring the source code of 387 Android applications, we identify the usage of Kotlin features on each version application's version and compute the moment that each feature is used for the first time. Finally, we identify the evolution trend that better describes the usage of these features. Results: 15 out of 26 features are used on at least 50% of applications. Moreover, we found that type inference, lambda and safe call are the most used features. Also, we observed that the most used Kotlin features are those first included on Android applications. Finally, we report that the majority of applications tend to add more instances of 24 out of 26 features along with their evolution. Conclusions: Our study generates 7 main findings. We present their implications, which are addressed to developers, researchers and tool builders in order to foster the use of Kotlin features to develop Android applications.

Perf-AL: Performance Prediction for Configurable Software through Adversarial Learning

Context: Many software systems are highly configurable. Different configuration options could lead to varying performances of the system. It is difficult to measure system performance in the presence of an exponential number of possible combinations of these options.

Goal: Predicting software performance by using a small configuration sample.

Method: This paper proposes Perf-AL to address this problem via adversarial learning. Specifically, we use a generative network combined with several different regularization techniques (L1 regularization, L2 regularization and a dropout technique) to output predicted values as close to the ground truth labels as possible. With the use of adversarial learning, our network identifies and distinguishes the predicted values of the generator network from the ground truth value distribution. The generator and the discriminator compete with each other by refining the prediction model iteratively until its predicted values converge towards the ground truth distribution.

Results: We argue that (i) the proposed method can achieve the same level of prediction accuracy, but with a smaller number of training samples. (ii) Our proposed model using seven real-world datasets show that our approach outperforms the state-of-the-art methods. This help to further promote software configurable performance.

Conclusion: Experimental results on seven public real-world datasets demonstrate that PERF-AL outperforms state-of-the-art software performance prediction methods.

Quest for the Golden Approach: An Experimental Evaluation of Duplicate Crowdtesting Reports Detection

Background: Given the invisibility and unpredictability of distributed crowdtesting processes, there is a large number of duplicate reports, and detecting these duplicate reports is an important task to help save testing effort. Although, many approaches have been proposed to automatically detect the duplicates, the comparison among them and the practical guidelines to adopt these approaches in crowdtesting remain vague.

Aims: We aim at conducting the first experimental evaluation of the commonly-used and state-of-the-art approaches for duplicate detection in crowdtesting reports, and exploring which is the golden approach.

Method: We begin with a systematic review of approaches for duplicate detection, and select ten state-of-the-art approaches for our experimental evaluation. We conduct duplicate detection with each approach on 414 crowdtesting projects with 59,289 reports collected from one of the largest crowdtesting platforms.

Results: Machine learning based approach, i.e., ML-REP, and deep learning based approach, i.e., DL-BiMPM, are the best two approaches for duplicate reports detection in crowdtesting, while the later one is more sensitive to the size of training data and more time-consuming for model training and prediction.

Conclusions: This paper provides new insights and guidelines to select appropriate duplicate detection techniques for duplicate crowdtesting reports detection.

Spectrum-Based Log Diagnosis

Background: Continuous Engineering practices are increasingly adopted in modern software development. However, a frequently reported need is for more effective methods to analyze the massive amounts of data resulting from the numerous build and test runs. Aims: We present and evaluate Spectrum-Based Log Diagnosis (SBLD), a method to help developers quickly diagnose problems found in complex integration and deployment runs. Inspired by Spectrum-Based Fault Localization, SBLD leverages the differences in event occurrences between logs for failing and passing runs, to highlight events that are stronger associated with failing runs.

Method: Using data provided by Cisco Norway, we empirically investigate the following questions: (i) How well does SBLD reduce the effort needed to identify all failure-relevant events in the log for a failing run? (ii) How is the performance of SBLD affected by available data? (iii) How does SBLD compare to searching for simple textual patterns that often occur in failure-relevant events? We answer (i) and (ii) using summary statistics and heatmap visualizations, and for (iii) we compare three configurations of SBLD (with resp. minimum, median and maximum data) against a textual search using Wilcoxon signed-rank tests and the Vargha-Delaney measure of stochastic superiority.

Results: Our evaluation shows that (i) SBLD achieves a significant effort reduction for the dataset used, (ii) SBLD benefits from additional logs for passing runs in general, and it benefits from additional logs for failing runs when there is a proportional amount of logs for passing runs in the data. Finally, (iii) SBLD and textual search are roughly equally effective at effort-reduction, while textual search has slightly better recall. We investigate the cause, and discuss how it is due to characteristics of a specific part of our data.

Conclusions: We conclude that SBLD shows promise as a method for diagnosing failing runs, that its performance is positively affected by additional data, but that it does not outperform textual search on the dataset considered. Future work includes investigating SBLD's generalizability on additional datasets.

Study on Patterns and Effect of Task Diversity in Software Crowdsourcing

Context: The success of software crowdsourcing depends on steady pools of task demand and active workers supply. Existing analysis reveals an average task failure ratio of 15.7% in software crowdsourcing market.

Goal: The objective of this study is to empirically investigate patterns and effect of task diversity in software crowdsourcing platform in order to improve the success and efficiency of software crowdsourcing.

Method: We first propose a conceptual task diversity model, and develop an approach to measuring and analyzing task diversity. More specifically, task diversity is characterized based on semantic similarity, dynamic competition level, and the analysis includes identifying the dominant attributes distinguishing the competition levels, and measuring the impact of task diversity on task success and worker performance in crowdsourcing platform. The empirical study is conducted on more than one year's real-world data from TopCoder, one of the leading software crowdsourcing platforms.

Results: We identified that monetary prize and task complexity are the dominant attributes that differentiate among different competition levels. Based on these dominant attributes, we concluded three task diversity patterns (configurations) from workers behavior perspective: responsive-to-prize, responsive-to-prize-and-complexity and over-responsive-to-prize. This study supports that the second pattern, i.e. responsive-to-prize-and-complexity configuration, associates with the lowest task failure ratio.

Conclusions: These findings are helpful for task requesters to plan for and improve task success in a more effective and efficient manner in software crowdsourcing platform.

Teamwork Quality and Team Success in Software Development: A Non-exact Replication Study

Background: Teamwork is central component of any software development organization. Therefore, the assessment of teamwork quality is important for team management in practice. TWQ (Teamwork Quality) is a measure of the quality of intra-team interactions developed specifically for software development teams. Aims: To perform a differentiated replication of previous studies on TWQ to expand the contexts in which TWQ has been applied, to refine the measurement instrument, increasing its reliability and validity, and to identify possibilities for future research on software teams. Method: We performed a cross-sectional survey collecting data from all 18 teams in one single software organization and from all members of each team, totaling 123 participants. Results: First, we refined the measurement instrument, achieving a more reliable and parsimonious instrument. Second, our results confirmed findings from previous studies when individual level data was used, but almost no confirmation was found when data was aggregated at team level. Conclusions: Our results partially supported previous studies but raised questions about the validity of the aggregation of individual data to team level measures of the studied constructs.

TopFilter: An Approach to Recommend Relevant GitHub Topics

Background: In the context of software development, GitHub has been at the forefront of platforms to store, analyze and maintain a large number of software repositories. Topics have been introduced by GitHub as an effective method to annotate stored repositories. However, labeling GitHub repositories should be carefully conducted to avoid adverse effects on project popularity and reachability. Aims: We present TopFilter, a novel approach to assist open source software developers in selecting suitable topics for GitHub repositories being created. Method: We built a project-topic matrix and applied a syntactic-based similarity function to recommend missing topics by representing repositories and related topics in a graph. The ten-fold cross-validation methodology has been used to assess the performance of TopFilter by considering different metrics, i.e., success rate, precision, recall, and catalog coverage. Result: The results show that TopFilter recommends good topics depending on different factors, i.e., collaborative filtering settings, considered datasets, and pre-processing activities. Moreover, TopFilter can be combined with a state-of-the-art topic recommender system (i.e., MNB network) to improve the overall prediction performance. Conclusion: Our results confirm that collaborative filtering techniques can successfully be used to provide relevant topics for GitHub repositories. Moreover, TopFilter can gain a significant boost in prediction performances by employing the outcomes obtained by the MNB network as its initial set of topics.

Understanding The Impact of Solver Choice in Model-Based Test Generation

Background: In model-based test generation, SMT solvers explore the state-space of the model in search of violations of specified properties. If the solver finds that a predicate can be violated, it produces a partial test specification demonstrating the violation.

Aims: The choice of solvers is important, as each may produce differing counterexamples. We aim to understand how solver choice impacts the effectiveness of generated test suites at finding faults.

Method: We have performed experiments examining the impact of solver choice across multiple dimensions, examining the ability to attain goal satisfaction and fault detection when satisfaction is achieved---varying the source of test goals, data types of model input, and test oracle.

Results: The results of our experiment show that solvers vary in their ability to produce counterexamples, and---for models where all solvers achieve goal satisfaction---in the resulting fault detection of the generated test suites. The choice of solver has an impact on the resulting test suite, regardless of the oracle, model structure, or source of testing goals.

Conclusions: The results of this study identify factors that impact fault-detection effectiveness, and advice that could improve future approaches to model-based test generation.

Using Extremely Simplified Functional Size Measures for Effort Estimation: an Empirical Study

Background: Functional size measures of software are widely used for effort estimation, because they provide a fairly objective quantification of software size, which is one of the main factors affecting software development effort. Unfortunately, in some conditions, performing the standard Function Point Analysis process may be too long and expensive. Moreover, functional measures could be needed before functional requirements have been elicited completely and at the required detail level. Aim: Basing effort estimation on measures that are simpler than Function Points---hence, faster and cheaper to obtain---could be beneficial, if good estimation accuracy is possible. In this paper, the level of accuracy that can be obtained using simple measures instead of Function Points---as defined by IFPUG, the International Function Point User Group---is evaluated, based on an empirical study. Method: An analysis of the data provided in the ISBSG dataset was performed. Effort models based on standard IFPUG Function Points and on the number of transactions were derived. The accuracy of the effort estimates delivered by the two models were then compared. Results: Effort models based on the number of transactions appear marginally less accurate than models based on standard IFPUG Function Points for new development projects, and marginally more accurate for projects extending previously developed software. Conclusions: Based on the results of the empirical study, it seems that using the number of transactions instead of the standard IFPUG Function Points measures does not cause estimate accuracy to change to a practically appreciable extent. This conclusion applies to effort models using size measure as the only independent variable. Additional research is necessary to evaluate the performance of effort models based on the number of transactions in combination with other factors.

What Industry Wants from Requirements Engineers in China?: An Exploratory and Comparative Study on RE Job Ads

[Background] Publications on the professional occupation of Requirements Engineering (RE) reported on market demands for both RE and non-RE qualifications and indicated the state-of-the-practice of RE roles in industry. However, prior research was not from the perspective of the RE area in the Software Engineering Body of Knowledge (SWEBOK). Nor, they shed light on the industry needs of RE professionals in China. [Aims] This paper focused on RE-specific tasks and skills sought after in China, from the perspective of RE activities elaborated in SWEBOK. [Method] Using an empirical qualitative research method, we selected and analyzed 535 job ads from China's two largest job portals. Job titles and descriptions of these ads were analyzed to uncover RErelevant responsibilities in the categories of RE activities in SWEBOK as well as RE skills. [Results] We identified the qualifications, experience and skills demanded by Chinese employers. Specifically, we reported 23 RE tasks demanded in the 535 job ads, from the perspective of SWEBOK RE activities. [Conclusion] Our findings reveal that in China's job market, 'requirements engineer' is explicitly used as a title of job ads. Plus, around 78% of the selected job positions want the employees to perform tasks in requirements elicitation. Editing requirements specification is the most in-demand task in RE activities. In addition, employers placed more emphasis on both RE-specific and broad industry experience.

Why Research on Test-Driven Development is Inconclusive?

[Background] Recent investigations into the effects of Test-Driven Development (TDD) have been contradictory and inconclusive. This hinders development teams to use research results as the basis for deciding whether and how to apply TDD. [Aim] To support researchers when designing a new study and to increase the applicability of TDD research in the decision-making process in industrial context, we aim at identifying the reasons behind the inconclusive research results in TDD. [Method] We studied the state of the art in TDD research published in top venues in the past decade, and analyzed the way these studies were set up. [Results] We identified five categories of factors that directly impact the outcome of studies on TDD. [Conclusions] This work can help researchers to conduct more reliable studies, and inform practitioners of risks they need to consider when consulting research on TDD.

Work Practices and Perceptions from Women Core Developers in OSS Communities

Background. The effect of gender diversity in open source communities has gained increasing attention from practitioners and researchers. For instance, organizations such as the Python Software Foundation and the OpenStack Foundation started actions to increase gender diversity and promote women to top positions in the communities. Problem. Although the general underrepresentation of women (a.k.a. horizontal segregation) in open source communities has been explored in a number of research studies, little is known about the vertical segregation in open source communities---which occurs when there are fewer women in high level positions. Aims. To address this research gap, in this paper we present the results of a mixed-methods study on gender diversity and work practices of core developers contributing to open-source communities. Method. In the first study, we used mining-software repositories procedures to identify the core developers of 711 open source projects, in order to understand how common are women core developers in open source communities and characterize their work practices. In the second study, we surveyed the women core developers we identified in the first study to collect their perceptions of gender diversity and gender bias they might have observed while contributing to open source systems. Results. Our findings show that open source communities present both horizontal and vertical segregation (only 2.3% of the core developers are women). Nevertheless, differently from previous studies, most of the women core developers (65.7%) report never having experienced gender discrimination when contributing to an open source project. Finally, we did not note substantial differences between the work practices among women and men core developers. Conclusions. We reflect on these findings and present some ideas that might increase the participation of women in open source communities.

SESSION: Emerging Results and Vision Papers

Using Situational and Narrative Analysis for Investigating the Messiness of Software Security

Background: Software engineering work and its context often has characteristics of what in social science is termed 'messy'; it has ephemeral and irregular qualities. This puts high demands on researchers doing inquiry and analysis. Aims: This paper aims to show what a combination of situational analysis (SA) and narrative analysis (NA) can bring to qualitative software engineering research, and in particular for situations characterised by mess. Method: SA and NA were applied to a case study on software security. Results: We found that these analysis methods helped us gain new insights and understandings and a broader perspective of the situation we are studying. Additionally, the methods helped collaboration in the analysis. Conclusion: We recommend applying and studying these and similar combinations of analysis approaches further.

The Impact of a Proposal for Innovation Measurement in the Software Industry

Background: Measuring an organization's capability to innovate and assessing its innovation output and performance is a challenging task. Previously, a comprehensive model and a suite of measurements to support this task were proposed. Aims: In the current paper, seven years since the publication of the paper titled Towards innovation measurement in the software industry, we have reflected on the impact of the work. Method: We have mainly relied on quantitative and qualitative analysis of the citations of the paper using an established classification schema. Results: We found that the article has had a significant scientific impact (indicated by the number of citations), i.e., (1) cited in literature from both software engineering and other fields, (2) cited in grey literature and peer-reviewed literature, and (3) substantial citations in literature not published in the English language. However, we consider a majority of the citations in the peer-reviewed literature (75 out of 116) as neutral, i.e., they have not used the innovation measurement paper in any substantial way. All in all, 38 out of 116 have used, modified or based their work on the definitions, measurements or the model proposed in the article. This analysis revealed a significant weakness of the citing work, i.e., among the citing papers, we found only two explicit comparisons to the innovation measurement proposal, and we found no papers that identify weaknesses of said proposal. Conclusions: This work highlights the need for being cautious of relying solely on the number of citations for understanding impact, and the need for further improving and supporting the peer-review process to identify unwarranted citations in papers.

Bug! Falha! Bachi! Fallo! Défaut! 程序错误!: What about Internationalization Testing in the Software Industry?

Background. Testing is an essential activity in the software development life cycle. Nowadays, testing activities are widely spread along the software development process, since software products are continuously tested to meet the user's expectations and to compete in global markets. In this context, internationalization testing is defined as the practice focused on determining that a software works properly in a specific language and in a particular region. Aims. This study aims to explore the particularities of internationalization testing in the software industry and discuss the importance of this practice from the point of view of professionals working in this context. Method. We developed an exploratory qualitative study and conducted interviews with professionals from an international software company, in order to understand three aspects of internationalization testing: general characteristics and importance of this practice, particularities of the process, and the role of test automation in this context. Results. An amount of 13 professionals participated in this study. Results demonstrated that internationalization testing is mostly related to aspects of graphical user interfaces. In this context, truncation and mistranslations are the main faults observed, and test automation might be difficult to implement and maintain due to amount validations that are human-dependent. Conclusion. Internationalization testing is an important practice to guarantee the quality of software products developed for global markets. However, this aspect of software testing remains unpopular or unfamiliar among professionals. This study is a step forward in the process of informing and enlightening academic researchers and practitioners in industry about this theme.

How long do Junior Developers take to Remove Technical Debt Items?

Background. Software engineering is one of the engineering fields with the highest inflow of junior engineers. Tools that utilize source code analysis to provide feedback on internal software quality, i.e. Technical Debt (TD), are valuable to junior developers who can learn and improve their coding skills with minimal consultations with senior colleagues. Objective. We aim at understating which SonarQube TD items junior developers prioritize during the refactoring and how long they take to refactor them. Method. We designed a case study with replicated design and we conducted it with 185 junior developers in two countries, that developed 23 projects with different programming languages and architectures. Results. Junior developers focus homogeneously on different types of TD items. Moreover, they can refactor items in a fraction of the estimated time, never spending more than 50% of the time estimated by SonarQube. Conclusion. Junior Developers appreciate the usage of SonarQube and considered as a useful tool. Companies might ask junior developers to quickly clean their code.

Trustworthiness Perceptions in Code Review: An Eye-tracking Study

Background: Automated program repair and other bug-fixing approaches are gaining attention in the software engineering community. Automation shows promise in reducing bug fixing costs. However, many developers express reluctance about accepting machine-generated patches into their codebases.

Aims: To contribute to the scientific understanding and the empirical investigation of human trust and perception with regards to automation in software maintenance.

Method: We design and conduct an eye-tracking study investigating how developers perceive trust as a function of code provenance (i.e., author or source). We systematically vary provenance while controlling for patch quality.

Results: In our study of ten participants, overall visual code scanning and the distribution of attention differed across identical code patches labeled as human- vs. machine-written. Participants looked more at the source code for human-labeled patches and looked more at tests for machine-labeled patches. Participants judged human-labeled patches to have better readability and coding style. However, participants were more comfortable giving a critical task to an automated program repair tool.

Conclusion: We find that there are significant differences in code review behavior based on trust as a function of patch provenance. Further, we find that important differences can be revealed by eye tracking. Our results may inform the subsequent design and analysis of automated repair techniques to increase developers' trust and, consequently, their deployment.

Avoiding Plagiarism in Systematic Literature Reviews: An Update Concern

Background: The number of Systematic Literature Reviews published in Software Engineering has been increasing in recent years. Due to this fact, one point not addressed is the possibility of plagiarism when dealing with SLRs, especially when dealing with SLR maintenance (update) activities.

Aims: To identify what researchers understand by SLR plagiarism, whether the process of maintaining these SLRs can increase the chances of committing plagiarism, and how this misconduct can be avoided.

Method: During a previous survey about SLR maintenance, we asked questions about SLR plagiarism issues and also a semi-structured interviews was performed to obtain responses from experts conducting SLRs on possible SLR maintenance plagiarism issues. Also, a comparison between SLR updates and their original works was made, through a plagiarism tool, looking for possible plagiarism situations.

Results: The general concern, in the survey and semi-structured interviews, is with study proper citation and what was published by others. We understand, therefore, plagiarism in SLR is not different from other areas of research. However, we have specific situations for the SLR maintenance process that lead us to present a set of good practices to avoid plagiarism. When comparing SLRs updates with their originals, possible plagiarism situations were identified, mainly in the method section, which was expected due to its reuse.

Conclusions: Although we do not need a specific plagiarism definition for SLRs when we think in SLR maintenance (update), we understand there is a higher possibility of committing plagiarism without proper care, as identified in the comparisons made between SLRs updates and their original versions.

Cohort Studies in Software Engineering: A Vision of the Future

Background. Most Mining Software Repositories (MSR) studies cannot obtain causal relations because they are not controlled experiments. The use of cohort studies as defined in epidemiology could help to overcome this shortcoming.

Objective. Propose the adoption of cohort studies in MSR research in particular and empirical Software Engineering (SE) in general. Method. We run a preliminary literature review to show the current state of the practice of cohort studies in SE. We explore how cohort studies overcome the issues that prevent the identification of causality in this type of non-experimental designs.

Results. The basic mechanism used by cohort studies to try to obtain causality consists of controlling potentially confounding variables. This is articulated by means of different techniques.

Conclusion. Cohort studies seem to be a promising approach to be used in MSR in particular and SE in general.

Automatic Identification of Code Smell Discussions on Stack Overflow: A Preliminary Investigation

Background: Code smells indicate potential design or implementation problems that may have a negative impact on programs. Similar to other software artefacts, developers use Stack Overflow (SO) to ask questions about code smells. However, given the high number of questions asked on the platform, and the limitations of the default tagging system, it takes significant effort to extract knowledge about code smells by means of manual approaches. Aim: We utilized supervised machine learning techniques to automatically identify code-smell discussions from SO posts. Method: We conducted an experiment using a manually labeled dataset that contains 3000 code-smell and 3000 non-code-smell posts to evaluate the performance of different classifiers when automatically identifying code smell discussions. Results: Our results show that Logistic Regression (LR) with parameter C=20 (inverse of regularization strength) and Bag of Words (BoW) feature extraction technique achieved the best performance amongst the algorithms we evaluated with a precision of 0.978, a recall of 0.965, and an F1-score of 0.971. Conclusion: Our results show that machine learning approach can effectively locate code-smell posts even if posts' title and/or tags cannot be of help. The technique can be used to extract code smell discussions from other textual artefacts (e.g., code reviews), and promisingly to extract SO discussions of other topics.

Web Frameworks for Desktop Apps: an Exploratory Study

Background: Novel frameworks for the development of desktop applications with web technologies have become popular. These desktop web app frameworks allow developers to reuse existing code and knowledge of web applications for the creation of cross-platform apps integrated with native APIs. Aims: To date, desktop web app frameworks have not been studied empirically. In this paper, we aim to fill this gap by characterizing the usage of web frameworks and providing evidence on how beneficial are their pros and how impactful are their cons. Method: We conducted an empirical study, collecting and analyzing 453 desktop web apps publicly available on GitHub. We performed qualitative and quantitative analyses to uncover the traits and issues of desktop web apps. Results: We found that desktop web app frameworks enable the development of cross-platform applications even for teams of limited dimensions, taking advantage of the abundant number of available web libraries. However, at the same time, bugs deriving from platform compatibility issues are common. Conclusions: Our study provides concrete evidence on some disadvantages associated with desktop web app frameworks. Future work is required to assess their impact on the required development and maintenance effort, and to investigate other aspects not considered in this first research.

GASSER: Genetic Algorithm for teSt Suite Reduction

Background. Regression testing is a practice that ensures a System Under Test (SUT) still works as expected after changes. The simplest regression testing approach is Retest-all, which consists of re-executing the entire Test Suite (TS) on the new version of the SUT. When SUT and its TS grow in size, applying Retest-all could be expensive. Test Suite Reduction (TSR) approaches would allow overcoming the above-mentioned issues by reducing TSs while preserving their fault-detection capability.

Aim. In this paper, we introduce GASSER (Genetic Algorithm for teSt SuitE Reduction), a new approach for TSR based on a multi-objective evolutionary algorithm, namely NSGA-II.

Method. GASSER reduces TSs by maximizing statement coverage and diversity of test cases, and by minimizing the size of the reduced TSs.

Results. The preliminary study shows that GASSER reduces more the TS size with a small effect on fault-detection capability when compared with traditional approaches.

Conclusions. These outcomes highlight the potential benefits of the use of multi-objective evolutionary algorithm in TSR field and pose the basis for future work.

Beyond Accuracy: ROI-driven Data Analytics of Empirical Data

Background: The unprecedented access to data has rendered a remarkable opportunity to analyze, understand, and optimize the investigation approaches in almost all the areas of (Empirical) Software Engineering. However, data analytics is time and effort consuming, thus, expensive, and not automatically valuable.

Objective: This vision paper demonstrates that it is crucial to consider Return-on-Investment (ROI) when performing Data Analytics. Decisions on "How much analytics is needed"? are hard to answer. ROI could guide for decision support on the What?, How?, and How Much? analytics for a given problem.

Method: The proposed conceptual framework is validated through two empirical studies that focus on requirements dependencies extraction in the Mozilla Firefox project. The two case studies are (i) Evaluation of fine-tuned BERT against Naive Bayes and Random Forest machine learners for binary dependency classification and (ii) Active Learning against passive Learning (random sampling) for REQUIRES dependency extraction. For both the cases, their analysis investment (cost) is estimated, and the achievable benefit from DA is predicted, to determine a break-even point of the investigation.

Results: For the first study, fine-tuned BERT performed superior to the Random Forest, provided that more than 40% of training data is available. For the second, Active Learning achieved higher F1 accuracy within fewer iterations and higher ROI compared to Baseline (Random sampling based RF classifier). In both the studies, estimate on, How much analysis likely would pay off for the invested efforts?, was indicated by the break-even point.

Conclusions: Decisions for the depth and breadth of DA of empirical data should not be made solely based on the accuracy measures. Since ROI-driven Data Analytics provides a simple yet effective direction to discover when to stop further investigation while considering the cost and value of the various types of analysis, it helps to avoid over-analyzing empirical data.

Profiling Developers Through the Lens of Technical Debt

Context: Technical Debt needs to be managed to avoid disastrous consequences, and investigating developers' habits concerning technical debt management is invaluable information in software development. Objective: This study aims to characterize how developers manage technical debt based on the code smells they induce and the refactorings they apply. Method: We mined a publicly-available Technical Debt dataset for Git commit information, code smells, coding violations, and refactoring activities for each developer of a selected project. Results: By combining this information, we profile developers to recognize prolific coders, highlight activities that discriminate among developer roles (reviewer, lead, architect), and estimate coding maturity and technical debt tolerance.

On the use of C# Unsafe Code Context: An Empirical Study of Stack Overflow

Background. C# maintains type safety and security by not allowing direct dangerous pointer arithmetic. To improve performance for special cases, pointer arithmetic is provided via an unsafe context. Programmers can use the C# unsafe keyword to encapsulate a code block, which can use pointer arithmetic. In the Common Language Runtime (CLR), unsafe code is referred to as unverifiable code. It then becomes the responsibility of the programmer to ensure the encapsulated code snippet is not dangerous. Naturally, this raises concern on whether such trust is misused by programmers when they promote the use of C# unsafe context. Aim. We aim to analyze the prevalence and vulnerabilities of share code examples using C# unsafe keyword in Stack Overflow (SO) code sharing platform. Method. By using some regular expressions and manual checks, we extracted C# unsafe code relevant posts from SO and categorized them into some software development scenarios. Results. In the entire SO data dump of September 2018, we find 2,283 C# snippets with the unsafe keyword. Among those posts, 27% of posts are about Image processing, where unsafe codes are mainly used for performance reasons. The second most popular category by 21% of the codes in the posts is used for 'Interoperability' reasons. That is 'unsafe' is used to enable 'Interoperability' between C# managed codes and unmanaged codes. The 'stackalloc' operator is the third category with 9% of unsafe code posts. The stackalloc operator allocates a block of memory on the stack. Since C# 7.2, Microsoft recommends against using 'stackalloc' in unsafe context whenever possible. Manual inspection shows 67 code snippets with dangerous functions that can introduce vulnerability if not used with caution (e.g., buffer overflow). Finally, 35% of 'Interoperability' posts have 'P/Invoke' tag were used outside NativeMethods class, which is in contrast to Microsoft design suggestion. Conclusion. Our study leads to 7 main findings, and these findings show the importance of cautiously using this feature.

Java Cryptography Uses in the Wild

[Background] Previous research has shown that developers commonly misuse cryptography APIs. [Aim] We have conducted an exploratory study to find out how crypto APIs are used in open-source Java projects, what types of misuses exist, and why developers make such mistakes. [Method] We used a static analysis tool to analyze hundreds of open-source Java projects that rely on Java Cryptography Architecture, and manually inspected half of the analysis results to assess the tool results. We also contacted the maintainers of these projects by creating an issue on the GitHub repository of each project, and discussed the misuses with developers. [Results] We learned that 85% of Cryptography APIs are misused, however, not every misuse has severe consequences. Developer feedback showed that security caveats in the documentation of crypto APIs are rare, developers may overlook misuses that originate in third-party code, and the context where a Crypto API is used should be taken into account. [Conclusion] We conclude that using Crypto APIs is still problematic for developers but blindly blaming them for such misuses may lead to erroneous conclusions.

SESSION: Industry Papers

What Makes Agile Test Artifacts Useful?: An Activity-Based Quality Model from a Practitioners' Perspective

Background: The artifacts used in Agile software testing and the reasons why these artifacts are used are fairly well-understood. However, empirical research on how Agile test artifacts are eventually designed in practice and which quality factors make them useful for software testing remains sparse. Aims: Our objective is two-fold. First, we identify current challenges in using test artifacts to understand why certain quality factors are considered good or bad. Second, we build an Activity-Based Artifact Quality Model that describes what Agile test artifacts should look like. Method: We conduct an industrial survey with 18 practitioners from 12 companies operating in seven different domains. Results: Our analysis reveals nine challenges and 16 factors describing the quality of six test artifacts from the perspective of Agile testers. Interestingly, we observed mostly challenges regarding language and traceability, which are well-known to occur in non-Agile projects. Conclusions: Although Agile software testing is becoming the norm, we still have little confidence about general do's and don'ts going beyond conventional wisdom. This study is the first to distill a list of quality factors deemed important to what can be considered as useful test artifacts.

A Taste of the Software Industry Perception of Technical Debt and its Management in Uruguay: A survey in software industry

Background: Technical debt (TD) has been an important focus of attention in recent years by the scientific community and the software industry. TD is a concept for expressing the lack of internal software quality that directly affects its capacity to evolve. Some studies have focused on the TD industry perspective. Aims: To characterize how the software industry professionals in Uruguay understand, perceive, and adopt technical debt management (TDM) activities. Method: To replicate a Brazilian survey with the Uruguayan software industry and compare their findings. Results: From 259 respondents, many indicated any awareness of the TD concept due to the faced difficult to realize how to associate such a concept with actual software issues. Therefore, it is possible to observe a considerable variability in the importance of TDM among the respondents. However, a small part of the respondents declares to carry out TDM activities in their organizations. A list of software technologies declared as used by practitioners was produced and can be useful to support TDM activities. Conclusions: The TD concept and its management are not common yet in Uruguay. There are indications of TD unawareness and difficulties in the conduction of some TDM activities considered as very important by the practitioners. There is a need for more effort aiming to disseminate the TD knowledge and to provide software technologies to support the adoption of TDM in Uruguay. It is likely other software engineering communities face similar issues. Therefore, further investigations in these communities can be of interest.

Getting Started with Chaos Engineering - design of an implementation framework in practice

Background. Chaos Engineering is proposed as a practice to verify a system's resilience under real, operational conditions. It employs fault injection, is originally developed at Netflix, and supported by several tools from there and other sources. Aims. We aim to introduce Chaos Engineering at ICA Gruppen AB, a group of companies whose core business is grocery retail, to improve their systems' resilience, and to capture our knowledge gained from literature and interviews in a process framework for the introduction of Chaos Engineering. Method. The research is conducted under the design science paradigm, where the problem is conceptualized through a literature study of Chaos Engineering and exploratory interviews in the company. The solution framework is designed based on the literature and a tool survey, and validated by letting software engineers at ICA apply parts of it to the software systems of ica.se website, including its e-shop. Results. The main contributions are a synthesis of Chaos Engineering literature and tools, in depth understanding of the needs of the case company, and guidelines for introducing Chaos Engineering. Conclusions. The applied parts were concluded to be feasible and they successfully discovered a set of initial improvement opportunities for the system's resilience, as well as a suitable Chaos Engineering practice for future resilience testing of the system. We recommend companies using the framework as a guide for the implementation of Chaos Engineering.