MaLTESQuE 2021: Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution

Full Citation in the ACM Digital Library

SESSION: Papers

Comparing within- and cross-project machine learning algorithms for code smell detection

Code smells represent a well-known problem in software engineering, since they are a notorious cause of loss of comprehensibility and maintainability. The most recent efforts in devising automatic machine learning-based code smell detection techniques have achieved unsatisfying results so far. This could be explained by the fact that all these approaches follow a within-project classification, i.e. training and test data are taken from the same source project, which combined with the imbalanced nature of the problem, produces datasets with a very low number of instances belonging to the minority class (i.e. smelly instances). In this paper, we propose a cross-project machine learning approach and compare its performance with a within-project alternative. The core idea is to use transfer learning to increase the overall number of smelly instances in the training datasets. Our results have shown that cross-project classification provides very similar performance with respect to within-project. Despite this finding does not yet provide a step forward in increasing the performance of ML techniques for code smell detection, it sets the basis for further investigations.

Unsupervised learning of general-purpose embeddings for code changes

Applying machine learning to tasks that operate with code changes requires their numerical representation. In this work, we propose an approach for obtaining such representations during pre-training and evaluate them on two different downstream tasks — applying changes to code and commit message generation. During pre-training, the model learns to apply the given code change in a correct way. This task requires only code changes themselves, which makes it unsupervised. In the task of applying code changes, our model outperforms baseline models by 5.9 percentage points in accuracy. As for the commit message generation, our model demonstrated the same results as supervised models trained for this specific task, which indicates that it can encode code changes well and can be improved in the future by pre-training on a larger dataset of easily gathered code changes.

VaryMinions: leveraging RNNs to identify variants in event logs

Business processes have to manage variability in their execution, e.g., to deliver the correct building permit in different municipalities. This variability is visible in event logs, where sequences of events are shared by the core process (building permit authorisation) but may also be specific to each municipality. To rationalise resources (e.g., derive a configurable business process capturing all municipalities’ permit variants) or to debug anomalous behaviour, it is mandatory to identify to which variant a given trace belongs. This paper supports this task by training Long Short Term Memory (LSTMs) and Gated Recurrent Units (GRUs) algorithms on two datasets: a configurable municipality and a travel expenses workflow. We demonstrate that variability can be identified accurately (>87%) and discuss the challenges of learning highly entangled variants.

Toward static test flakiness prediction: a feasibility study

Flaky tests are tests that exhibit both a passing and failing behavior when run against the same code. While the research community has attempted to define automated approaches for detecting and addressing test flakiness, most of them suffer from scalability issues and uncertainty as they require test cases to be run multiple times. This limitation has been recently targeted by means of machine learning solutions that could predict the flakiness of tests using a set of both static and dynamic metrics that would avoid the re-execution of tests. Recognizing the effort spent so far, this paper poses the first steps toward an orthogonal view of the problem, namely the classification of flaky tests using only statically computable software metrics. We propose a feasibility study on 72 projects of the iDFlakies dataset, and investigate the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells. First, we statistically assess those differences. Second, we build a logistic regression model to verify the extent to which the differences observed are still significant when the metrics are considered together. The results show a relation between test flakiness and a number of test and production code factors, indicating the possibility to build classification approaches that exploit those factors to predict test flakiness.

Building a bot for automatic expert retrieval on discord

It is common for software practitioners to look for experts on online chat platforms, such as Discord. However, finding them is a complex activity that requires a deep knowledge of the open source community. As a consequence, newcomers and casual participants may not be able to adequately find experts willing to discuss a particular topic.

Our paper describes a bot that provides a ranked list of Discord users that are experts in a particular set of topics. Our bot uses simple heuristics to model expertise, such as a word occurrence table and word embeddings. Our bot shows that at least half of the retrieved users are indeed experts.

Metrics selection for load monitoring of service-oriented system

Background. Complex software systems produce a large amount of data depicting their internal state and activities. The data can be monitored to make estimations and predictions of the status of the system, helping taking preventative actions in case of impending malfunctions and failures. However, a complex system may reveal thousands of internal metrics, which makes it a non-trivial task to decide which metrics are the most important to monitor.

Objective. In this work we aim at finding a subset of metrics to collect and analyse for the monitoring of the load in a Service-oriented system.

Method. We use a performance test bench tool to generate load of different intensities on the target system, which is a specific service-oriented application platform. The numeric metrics data collected from the system is combined with the load intensity at each moment. The combined data is used to analyse which metrics are best at estimating the load of the system. By using a regression analysis it was possible to rank the metrics by their ability to measure the load of the system.

Results. The results show that (1) the use of machine learning regressor allows to correctly measure the load of a service-oriented system, and (2) the most important metrics are related to network traffic and request counts, as well as memory usage and disk activity.

Conclusion. The results help with the designs of efficient monitoring tool. In addition, further investigation should be focused on exploring more precise machine learning model to further improve the metric selection process.