SBST '18- Proceedings of the 11th International Workshop on Search-Based Software Testing

Full Citation in the ACM Digital Library

SESSION: Keynote I

Predictive analytics for software testing: keynote paper

This keynote discusses the use of Predictive Analytics for Software Engineering, and in particular for Software Defect Prediction and Software Testing, by presenting the latest results achieved in these fields leveraging Artificial Intelligence, Search-based and Machine Learning methods, and by giving some directions for future work.

SESSION: Technical session I

Multifaceted test suite generation using primary and supporting fitness functions

Dozens of criteria have been proposed to judge testing adequacy. Such criteria are important, as they guide automated generation efforts. Yet, the current use of such criteria in automated generation contrasts how such criteria are used by humans. For a human, coverage is part of a multifaceted combination of testing strategies. In automated generation, coverage is typically the goal, and a single fitness function is applied at one time. We propose that the key to improving the fault detection efficacy of search-based test generation approaches lies in a targeted, multifaceted approach pairing primary fitness functions that effectively explore the structure of the class under test with lightweight supporting fitness functions that target particular scenarios likely to trigger an observable failure.

This report summarizes our findings to date, details the hypothesis of primary and supporting fitness functions, and identifies outstanding research challenges related to multifaceted test suite generation. We hope to inspire new advances in search-based test generation that could benefit our software-powered society.

Search-based optimization for the testing resource allocation problem: research trends and opportunities

This paper explores the usage of search-based techniques for the Testing Resource Allocation Problem (TRAP). We focus on the analysis of the literature, surveying the research proposals where search-based techniques are exploited for different formulations of the TRAP. Three dimensions are considered: the model formulation, solution, and validation. The analysis allows to derive several observations, and finally outline some new research directions towards better (namely, closer to real-world settings) modelling and solutions, highlighting the most promising areas of investigation.

An effective approach for regression test case selection using pareto based multi-objective harmony search

Regression testing is a way of catching bugs in new builds and releases to avoid the product risks. Corrective, progressive, retest all and selective regression testing are strategies to perform regression testing. Retesting all existing test cases is one of the most reliable approaches but it is costly in terms of time and effort. This limitation opened a scope to optimize regression testing cost by selecting only a subset of test cases that can detect faults in optimal time and effort. This paper proposes Pareto based Multi-Objective Harmony Search approach for regression test case selection from an existing test suite to achieve some test adequacy criteria. Fault coverage, unique faults covered and algorithm execution time are utilised as performance measures to achieve optimization criteria. The performance evaluation of proposed approach is performed against Bat Search and Cuckoo Search optimization. The results of statistical tests indicate significant improvement over existing approaches.


Evaluating search-based techniques with statistical tests

This tutorial covers the basics of how to use statistical tests to evaluate and compare search-algorithms, in particular when applied on software engineering problems. Search-algorithms like Hill Climbing and Genetic Algorithms are randomised. Running such randomised algorithms twice on the same problem can give different results. It is hence important to run such algorithms multiple times to collect average results, and avoid so publishing wrong conclusions that were based on just luck. However, there is the question of how often such runs should be repeated. Given a set of n repeated experiments, is such n large enough to draw sound conclusions? Or should had more experiments been run? Statistical tests like the Wilcoxon-Mann-Whitney U-test can be used to answer these important questions.

SESSION: Tool competition

Java unit testing tool competition: sixth round

We report on the advances in this sixth edition of the JUnit tool competitions. This year the contest introduces new benchmarks to assess the performance of JUnit testing tools on different types of real-world software projects. Following on the statistical analyses from the past contest work, we have extended it with the combined tools performance aiming to beat the human made tests. Overall, the 6th competition evaluates four automated JUnit testing tools taking as baseline human written test cases for the selected benchmark projects. The paper details the modifications performed to the methodology and provides full results of the competition.

T3 @SBST2018 benchmark, and how much we can get from asemantical testing

This paper discusses the performance of the automated testing tool for Java called T3 and compares it with few other tools and human written tests in a benchmark set by the Java Unit Testing Tool Contest 2018. Since all the compared tools rely on randomization when generating their test data, albeit to different degrees and with different heuristics, this paper also tries to give some insight on just how far we can go without having to reconstruct the precise semantic of the programs under test in order to test them.

Evosuite at the SBST 2018 tool competition

EvoSuite is a search-based tool that automatically generates executable unit tests for Java code (JUnit tests). This paper summarises the results and experiences of EvoSuite's participation at the sixth unit testing competition at SBST 2018, where EvoSuite achieved the highest overall score (687 points) for the fifth time in six editions of the competition.


Testing and continuous integration at scale: limits, costs, and expectations

Build and verification systems and processes have changed significantly in the past decade and so did the corresponding development processes. In fact, build and verification systems are a lower bound of how fast a company can ship new features---the more time spent to compile and test code the worse the company's ability to compete on the market. In other words, verification is key but needs to be affordable---in terms of money but also in terms of time. This keynote will be about some fundamental concepts of modern industry software testing, their cost and limits as well as expectations of software developers and teams against build, test, and development processes.

SESSION: Technical session II

From operational to declarative specifications using a genetic algorithm

In specification-based test generation, sometimes having a formal specification is not sufficient, since the specification may be in a different formalism from that required by the generation approach being used. In this paper, we deal with this problem specifically in the context in which, while having a formal specification in the form of an operational invariant written in a sequential programming language, one needs, for test generation, a declarative invariant in a logical formalism. We propose a genetic algorithm that given a catalog of common properties of invariants, such as acyclicity, sortedness and balance, attempts to evolve a conjunction of these that most accurately approximates an original operational specification. We present some details of the algorithm, and an experimental evaluation based on a benchmark of data structures, for which we evolve declarative logical invariants from operational ones.

To call, or not to call: contrasting direct and indirect branch coverage in test generation

While adequacy criteria offer an end-point for testing, they do not mandate how targets are covered. Branch Coverage may be attained through direct calls to methods, or through indirect calls between methods. Automated generation is biased towards the rapid gains offered by indirect coverage. Therefore, even with the same end-goal, humans and automation produce very different tests. Direct coverage may yield tests that are more understandable, and that detect faults missed by traditional approaches. However, the added burden for the generation framework may result in lower coverage and faults that emerge through method interactions may be missed.

To compare the two approaches, we have generated test suites for both, judging efficacy against real faults. We have found that requiring direct coverage results in lower achieved coverage and likelihood of fault detection. However, both forms of Branch Coverage cover code and detect faults that the other does not. By isolating methods, Direct Branch Coverage is less constrained in the choice of input. However, traditional Branch Coverage is able to leverage method interactions to discover faults. Ultimately, both are situationally applicable within the context of a broader testing strategy.

Generating test input with deep reinforcement learning

Test data generation is a tedious and laborious process. Search-based Software Testing (SBST) automatically generates test data optimising structural test criteria using metaheuristic algorithms. In essence, metaheuristic algorithms are systematic trial-and-error based on the feedback of fitness function. This is similar to an agent of reinforcement learning which iteratively decides an action based on the current state to maximise the cumulative reward. Inspired by this analogy, this paper investigates the feasibility of employing reinforcement learning in SBST to replace human designed meta-heuristic algorithms. We reformulate the software under test (SUT) as an environment of reinforcement learning. At the same time, we present GunPowder, a novel framework for SBST which extends SUT to the environment. We train a Double Deep Q-Networks (DDQN) agent with deep neural network and evaluate the effectiveness of our approach by conducting a small empirical study. Finally, we find that agents can learn metaheuristic algorithms for SBST, achieving 100% branch coverage for training functions. Our study sheds light on the future integration of deep neural network and SBST.

SESSION: Technical session III

An empirical analysis of the mutation operator for run-time adaptive testing in self-adaptive systems

A self-adaptive system (SAS) can reconfigure at run time in response to uncertainty and/or adversity to continually deliver an acceptable level of service. An SAS can experience uncertainty during execution in terms of environmental conditions for which it was not explicitly designed as well as unanticipated combinations of system parameters that result from a self-reconfiguration or misunderstood requirements. Run-time testing provides assurance that an SAS continually behaves as it was designed even as the system reconfigures and the environment changes. Moreover, introducing adaptive capabilities via lightweight evolutionary algorithms into a run-time testing framework can enable an SAS to effectively update its test cases in response to uncertainty alongside the SAS's adaptation engine while still maintaining assurance that requirements are being satisfied. However, the impact of the evolutionary parameters that configure the search process for run-time testing may have a significant impact on test results. Therefore, this paper provides an empirical study that focuses on the mutation parameter that guides online evolution as applied to a run-time testing framework, in the context of an SAS.

On the effect of object redundancy elimination in randomly testing collection classes

In this paper, we analyze the effect of reducing object redundancy in random testing, by comparing the Randoop random testing tool with a version of the tool that disregards tests that only produce objects that have been previously generated by other tests. As a side effect, this variant also identifies methods in the software under test that never participate in state changes, and uses these more heavily when building assertions.

Our evaluation of this strategy concentrates on collection classes, since in this context of object-oriented implementations that describe stateful objects obbeying complex invariants, object variability is highly relevant. Our experimental comparison takes the main data structures in java.util, and shows that our object redundancy reduction strategy has an important impact in testing collections, measured in terms of code coverage and mutation killing.