ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Full Citation in the ACM Digital Library

SESSION: Invited Presentation

Automated debugging: past, present, and future (ISSTA impact paper award)

Chris Parnin
Alessandro Orso

The paper titled “Are Automated Debugging Techniques Actually Helping Programmers?” was published in the proceedings of the International Symposium on Software Testing and Analysis (ISSTA) in 2011, and has been selected to receive the ISSTA 2021 Impact Paper Award. The paper investigated, through two user studies, how developers used and benefited from popular automated debugging techniques. The results of the studies provided (1) evidence that several assumptions made by automated debugging techniques did not hold in practice and (2) insights on limitations of existing approaches and how these limitations could be addressed. In this talk, we revisit the original paper and the work that led to it. We then assess the impact of that research by reviewing how the area of automated debugging has evolved since the paper was published. Finally, we conclude the talk by reflecting on the current state of the art in this area and discussing open issues and potential directions for future work.

SESSION: Web Applications

Identifying privacy weaknesses from multi-party trigger-action integration platforms

Kulani Mahadewa
Yanjun Zhang
Guangdong Bai
Lei Bu
Zhiqiang Zuo
Dileepa Fernando
Zhenkai Liang
Jin Song Dong

With many trigger-action platforms that integrate Internet of Things (IoT) systems and online services, rich functionalities transparently connecting digital and physical worlds become easily accessible for the end users. On the other hand, such facilities incorporate multiple parties whose data control policies may radically differ and even contradict each other, and thus privacy violations may arise throughout the lifecycle (e.g., generation and transmission) of triggers and actions. In this work, we conduct an in-depth study on the privacy issues in multi-party trigger-action integration platforms (TAIPs). We first characterize privacy violations that may arise with the integration of heterogeneous systems and services. Based on this knowledge, we propose Taifu, a dynamic testing approach to identify privacy weaknesses from the TAIP. The key insight of Taifu is that the applets which actually program the trigger-action rules can be used as test cases to explore the behavior of the TAIP. We evaluate the effectiveness of our approach by applying it on the TAIPs that are built around the IFTTT platform. To our great surprise, we find that privacy violations are prevalent among them. Using the automatically generated 407 applets, each from a different TAIP, Taifu detects 194 cases with access policy breaches, 218 access control missing, 90 access revocation missing, 15 unintended flows, and 73 over-privilege access.

WebEvo: taming web application evolution via detecting semantic structure changes

Fei Shao
Rui Xu
Wasif Haque
Jingwei Xu
Ying Zhang
Wei Yang
Yanfang Ye
Xusheng Xiao

The development of Web technology and the beginning of the Big Data era have led to the development of technologies for extracting data from websites, such as information retrieval (IR) and robotic process automation (RPA) tools. As websites are constantly evolving, to prevent these tools from functioning improperly due to website evolution, it is important to monitor the changes in websites and report them to the developers and testers. Existing monitoring tools mainly use DOM-tree based techniques to detect changes in the new web pages. However, these monitoring tools incorrectly report content-based changes (i.e., web content refreshed every time a web page is retrieved) as the changes that will adversely affect the performance of the IR and RPA tools. This results in false warnings since the IR and RPA tools typically consider these changes as expected and retrieve dynamic data from them. Moreover, these monitoring tools cannot identify GUI widget evolution (e.g., moving a button), and thus cannot help the IR and RPA tools adapt to the evolved widgets (e.g., automatic repair of locators for the evolved widgets). To address the limitations of the existing monitoring tools, we propose an approach, WebEvo, that leverages historic pages to identify the DOM elements whose changes are content-based changes, which can be safely ignored when reporting changes in the new web pages. Furthermore, to identify refactoring changes that preserve semantics and appearances of GUI widgets, WebEvo adapts computer vision (CV) techniques to identify the mappings of the GUI widgets from the old web page to the new web page on an element-by-element basis. Empirical evaluations on 13 real-world websites from 9 popular categories demonstrate the superiority of WebEvo over the existing DOM-tree based detection or whole-page visual comparison in terms of both effectiveness and efficiency.

Modular call graph construction for security scanning of Node.js applications

Benjamin Barslev Nielsen
Martin Toldam Torp
Anders Møller

Most of the code in typical Node.js applications comes from third-party libraries that consist of a large number of interdependent modules. Because of the dynamic features of JavaScript, it is difficult to obtain detailed information about the module dependencies, which is vital for reasoning about the potential consequences of security vulnerabilities in libraries, and for many other software development tasks. The underlying challenge is how to construct precise call graphs that capture the connectivity between functions in the modules.

In this work we present a novel approach to call graph construction for Node.js applications that is modular, taking into account the modular structure of Node.js applications, and sufficiently accurate and efficient to be practically useful. We demonstrate experimentally that the constructed call graphs are useful for security scanning, reducing the number of false positives by 81% compared to npm audit and with zero false negatives. Compared to js-callgraph, the call graph construction is significantly more accurate and efficient. The experiments also show that the analysis time is reduced substantially when reusing modular call graphs.

SESSION: Testing Deep Learning Systems

Attack as defense: characterizing adversarial examples using robustness

Zhe Zhao
Guangke Chen
Jingyi Wang
Yiwei Yang
Fu Song
Jun Sun

As a new programming paradigm, deep learning has expanded its application to many real-world problems. At the same time, deep learning based software are found to be vulnerable to adversarial attacks. Though various defense mechanisms have been proposed to improve robustness of deep learning software, many of them are ineffective against adaptive attacks. In this work, we propose a novel characterization to distinguish adversarial examples from benign ones based on the observation that adversarial examples are significantly less robust than benign ones. As existing robustness measurement does not scale to large networks, we propose a novel defense framework, named attack as defense (A²D), to detect adversarial examples by effectively evaluating an example’s robustness. A²D uses the cost of attacking an input for robustness evaluation and identifies those less robust examples as adversarial since less robust examples are easier to attack. Extensive experiment results on MNIST, CIFAR10 and ImageNet show that A²D is more effective than recent promising approaches. We also evaluate our defense against potential adaptive attacks and show that A²D is effective in defending carefully designed adaptive attacks, e.g., the attack success rate drops to 0% on CIFAR10.

Exposing previously undetectable faults in deep neural networks

Isaac Dunn
Hadrien Pouget
Daniel Kroening
Tom Melham

Existing methods for testing DNNs solve the oracle problem by constraining the raw features (e.g. image pixel values) to be within a small distance of a dataset example for which the desired DNN output is known. But this limits the kinds of faults these approaches are able to detect. In this paper, we introduce a novel DNN testing method that is able to find faults in DNNs that other methods cannot. The crux is that, by leveraging generative machine learning, we can generate fresh test inputs that vary in their high-level features (for images, these include object shape, location, texture, and colour). We demonstrate that our approach is capable of detecting deliberately injected faults as well as new faults in state-of-the-art DNNs, and that in both cases, existing methods are unable to find these faults.

DeepCrime: mutation testing of deep learning systems based on real faults

Nargiz Humbatova
Gunel Jahangirova
Paolo Tonella

Deep Learning (DL) solutions are increasingly adopted, but how to test them remains a major open research problem. Existing and new testing techniques have been proposed for and adapted to DL systems, including mutation testing. However, no approach has investigated the possibility to simulate the effects of real DL faults by means of mutation operators. We have defined 35 DL mutation operators relying on 3 empirical studies about real faults in DL systems. We followed a systematic process to extract the mutation operators from the existing fault taxonomies, with a formal phase of conflict resolution in case of disagreement. We have implemented 24 of these DL mutation operators into DeepCrime, the first source-level pre-training mutation tool based on real DL faults. We have assessed our mutation operators to understand their characteristics: whether they produce interesting, i.e., killable but not trivial, mutations. Then, we have compared the sensitivity of our tool to the changes in the quality of test data with that of DeepMutation++, an existing post-training DL mutation tool.

DeepHyperion: exploring the feature space of deep learning-based systems through illumination search

Tahereh Zohdinasab
Vincenzo Riccio
Alessio Gambi
Paolo Tonella

Deep Learning (DL) has been successfully applied to a wide range of application domains, including safety-critical ones. Several DL testing approaches have been recently proposed in the literature but none of them aims to assess how different interpretable features of the generated inputs affect the system's behaviour.

In this paper, we resort to Illumination Search to find the highest-performing test cases (i.e., misbehaving and closest to misbehaving), spread across the cells of a map representing the feature space of the system. We introduce a methodology that guides the users of our approach in the tasks of identifying and quantifying the dimensions of the feature space for a given domain. We developed DeepHyperion, a search-based tool for DL systems that illuminates, i.e., explores at large, the feature space, by providing developers with an interpretable feature map where automatically generated inputs are placed along with information about the exposed behaviours.

Automatic test suite generation for key-points detection DNNs using many-objective search (experience paper)

Fitash Ul Haq
Donghwan Shin
Lionel C. Briand
Thomas Stifter
Jun Wang

Automatically detecting the positions of key-points (e.g., facial key-points or finger key-points) in an image is an essential problem in many applications, such as driver's gaze detection and drowsiness detection in automated driving systems. With the recent advances of Deep Neural Networks (DNNs), Key-Points detection DNNs (KP-DNNs) have been increasingly employed for that purpose. Nevertheless, KP-DNN testing and validation have remained a challenging problem because KP-DNNs predict many independent key-points at the same time---where each individual key-point may be critical in the targeted application---and images can vary a great deal according to many factors.

In this paper, we present an approach to automatically generate test data for KP-DNNs using many-objective search. In our experiments, focused on facial key-points detection DNNs developed for an industrial automotive application, we show that our approach can generate test suites to severely mispredict, on average, more than 93% of all key-points. In comparison, random search-based test data generation can only severely mispredict 41% of them. Many of these mispredictions, however, are not avoidable and should not therefore be considered failures. We also empirically compare state-of-the-art, many-objective search algorithms and their variants, tailored for test suite generation. Furthermore, we investigate and demonstrate how to learn specific conditions, based on image characteristics (e.g., head posture and skin color), that lead to severe mispredictions. Such conditions serve as a basis for risk analysis or DNN retraining.

Efficient white-box fairness testing through gradient search

Lingfeng Zhang
Yueling Zhang
Min Zhang

Deep learning (DL) systems are increasingly deployed for autonomous decision-making in a wide range of applications. Apart from the robustness and safety, fairness is also an important property that a well-designed DL system should have. To evaluate and improve individual fairness of a model, systematic test case generation for identifying individual discriminatory instances in the input space is essential. In this paper, we propose a framework EIDIG for efficiently discovering individual fairness violation. Our technique combines a global generation phase for rapidly generating a set of diverse discriminatory seeds with a local generation phase for generating as many individual discriminatory instances as possible around these seeds under the guidance of the gradient of the model output. In each phase, prior information at successive iterations is fully exploited to accelerate convergence of iterative optimization or reduce frequency of gradient calculation. Our experimental results show that, on average, our approach EIDIG generates 19.11% more individual discriminatory instances with a speedup of 121.49% when compared with the state-of-the-art method and mitigates individual discrimination by 80.03% with a limited accuracy loss after retraining.

DialTest: automated testing for recurrent-neural-network-driven dialogue systems

Zixi Liu
Yang Feng
Zhenyu Chen

With the tremendous advancement of recurrent neural networks(RNN), dialogue systems have achieved significant development. Many RNN-driven dialogue systems, such as Siri, Google Home, and Alexa, have been deployed to assist various tasks. However, accompanying this outstanding performance, RNN-driven dialogue systems, which are essentially a kind of software, could also produce erroneous behaviors and result in massive losses. Meanwhile, the complexity and intractability of RNN models that power the dialogue systems make their testing challenging. In this paper, we design and implement DialTest, the first RNN-driven dialogue system testing tool. DialTest employs a series of transformation operators to make realistic changes on seed data while preserving their oracle information properly. To improve the efficiency of detecting faults, DialTest further adopts Gini impurity to guide the test generation process. We conduct extensive experiments to validate DialTest. We first experiment it on two fundamental tasks, i.e., intent detection and slot filling, of natural language understanding. The experiment results show that DialTest can effectively detect hundreds of erroneous behaviors for different RNN-driven natural language understanding (NLU) modules of dialogue systems and improve their accuracy via retraining with the generated data. Further, we conduct a case study on an industrial dialogue system to investigate the performance of DialTest under the real usage scenario. The study shows DialTest can detect errors and improve the robustness of RNN-driven dialogue systems effectively.

AdvDoor: adversarial backdoor attack of deep learning system

Quan Zhang
Yifeng Ding
Yongqiang Tian
Jianmin Guo
Min Yuan
Yu Jiang

Deep Learning (DL) system has been widely used in many critical applications, such as autonomous vehicles and unmanned aerial vehicles. However, their security is threatened by backdoor attack, which is achieved by adding artificial patterns on specific training data. Existing attack methods normally poison the data using a patch, and they can be easily detected by existing detection methods. In this work, we propose the Adversarial Backdoor, which utilizes the Targeted Universal Adversarial Perturbation (TUAP) to hide the anomalies in DL models and confuse existing powerful detection methods. With extensive experiments, it is demonstrated that Adversarial Backdoor can be injected stably with an attack success rate around 98%. Moreover, Adversarial Backdoor can bypass state-of-the-art backdoor detection methods. More specifically, only around 37% of the poisoned models can be caught, and less than 29% of the poisoned data cannot bypass the detection. In contrast, for the patch backdoor, all the poisoned models and more than 80% of the poisoned data will be detected. This work intends to alarm the researchers and developers of this potential threat and to inspire the designing of effective detection methods.

ModelDiff: testing-based DNN similarity comparison for model reuse detection

Yuanchun Li
Ziqi Zhang
Bingyan Liu
Ziyue Yang
Yunxin Liu

The knowledge of a deep learning model may be transferred to a student model, leading to intellectual property infringement or vulnerability propagation. Detecting such knowledge reuse is nontrivial because the suspect models may not be white-box accessible and/or may serve different tasks. In this paper, we propose ModelDiff, a testing-based approach to deep learning model similarity comparison. Instead of directly comparing the weights, activations, or outputs of two models, we compare their behavioral patterns on the same set of test inputs. Specifically, the behavioral pattern of a model is represented as a decision distance vector (DDV), in which each element is the distance between the model's reactions to a pair of inputs. The knowledge similarity between two models is measured with the cosine similarity between their DDVs. To evaluate ModelDiff, we created a benchmark that contains 144 pairs of models that cover most popular model reuse methods, including transfer learning, model compression, and model stealing. Our method achieved 91.7% correctness on the benchmark, which demonstrates the effectiveness of using ModelDiff for model reuse detection. A study on mobile deep learning apps has shown the feasibility of ModelDiff on real-world models.

SESSION: Android Application Analysis and Testing

Parema: an unpacking framework for demystifying VM-based Android packers

Lei Xue
Yuxiao Yan
Luyi Yan
Muhui Jiang
Xiapu Luo
Dinghao Wu
Yajin Zhou

Android packers have been widely adopted by developers to protect apps from being plagiarized. Meanwhile, various unpacking tools unpack the apps through direct memory dumping. To defend against these off-the-shelf unpacking tools, packers start to adopt virtual machine (VM) based protection techniques, which replace the original Dalvik bytecode (DCode) with customized bytecode (PCode) in memory. This defeats the unpackers using memory dumping mechanisms. However, little is known about whether such packers can provide enough protection to Android apps. In this paper, we aim to shed light on these questions and take the first step towards demystifying the protections provided to the apps by the VM-based packers. We proposed novel program analysis techniques to investigate existing commercial VM-based packers including a learning phase and a deobfuscation phase.We aim at deobfuscating the VM-protection DCode in three scenarios, recovering original DCode or its semantics with training apps, and restoring the semantics without training apps. We also develop a prototype named Parema to automate much work of the deobfuscation procedure. By applying it to the online VM-based Android packers, we reveal that all evaluated packers do not provide adequate protection and could be compromised.

An infrastructure approach to improving effectiveness of Android UI testing tools

Wenyu Wang
Wing Lam
Tao Xie

Due to the importance of Android app quality assurance, many Android UI testing tools have been developed by researchers over the years. However, recent studies show that these tools typically achieve low code coverage on popular industrial apps. In fact, given a reasonable amount of run time, most state-of-the-art tools cannot even outperform a simple tool, Monkey, on popular industrial apps with large codebases and sophisticated functionalities. Our motivating study finds that these tools perform two types of operations, UI Hierarchy Capturing (capturing information about the contents on the screen) and UI Event Execution (executing UI events, such as clicks), often inefficiently using UIAutomator, a component of the Android framework. In total, these two types of operations use on average 70% of the given test time.

Based on this finding, to improve the effectiveness of Android testing tools, we propose TOLLER, a tool consisting of infrastructure enhancements to the Android operating system. TOLLER injects itself into the same virtual machine as the app under test, giving TOLLER direct access to the app’s runtime memory. TOLLER is thus able to directly (1) access UI data structures, and thus capture contents on the screen without the overhead of invoking the Android framework services or remote procedure calls (RPCs), and (2) invoke UI event handlers without needing to execute the UI events. Compared with the often-used UIAutomator, TOLLER reduces average time usage of UI Hierarchy Capturing and UI Event Execution operations by up to 97% and 95%, respectively. We integrate TOLLER with existing state-of-the-art/practice Android UI testing tools and achieve the range of 11.8% to 70.1% relative code coverage improvement on average. We also find that TOLLER-enhanced tools are able to trigger 1.4x to 3.6x distinct crashes compared with their original versions without TOLLER enhancement. These improvements are so substantial that they also change the relative competitiveness of the tools under empirical comparison. Our findings highlight the practicality of TOLLER as well as raising the community awareness of infrastructure support’s significance beyond the community’s existing heavy focus on algorithms.

Semantic matching of GUI events for test reuse: are we there yet?

Leonardo Mariani
Ali Mohebbi
Mauro Pezzè
Valerio Terragni

GUI testing is an important but expensive activity. Recently, research on test reuse approaches for Android applications produced interesting results. Test reuse approaches automatically migrate human-designed GUI tests from a source app to a target app that shares similar functionalities. They achieve this by exploiting semantic similarity among textual information of GUI widgets. Semantic matching of GUI events plays a crucial role in these approaches. In this paper, we present the first empirical study on semantic matching of GUI events. Our study involves 253 configurations of the semantic matching, 337 unique queries, and 8,099 distinct GUI events. We report several key findings that indicate how to improve semantic matching of test reuse approaches, propose SemFinder a novel semantic matching algorithm that outperforms existing solutions, and identify several interesting research directions.

GUIDER: GUI structure and vision co-guided test script repair for Android apps

Tongtong Xu
Minxue Pan
Yu Pei
Guiyin Li
Xia Zeng
Tian Zhang
Yuetang Deng
Xuandong Li

GUI testing is an essential part of regression testing for Android apps. For regression GUI testing to remain effective, it is important that obsolete GUI test scripts get repaired after the app has evolved. In this paper, we propose a novel approach named GUIDER to automated repair of GUI test scripts for Android apps. The key novelty of the approach lies in the utilization of both structural and visual information of widgets on app GUIs to better understand what widgets of the base version app become in the updated version. A supporting tool has been implemented for the approach. Experiments conducted on the popular messaging and social media app WeChat show that GUIDER is both effective and efficient. Repairs produced by GUIDER enabled 88.8% and 54.9% more test actions to run correctly than those produced by existing approaches to GUI test repair that rely solely on visual or structural information of app GUIs.

Understanding and finding system setting-related defects in Android apps

Jingling Sun
Ting Su
Junxin Li
Zhen Dong
Geguang Pu
Tao Xie
Zhendong Su

Android, the most popular mobile system, offers a number of user-configurable system settings (e.g., network, location, and permission) for controlling devices and apps. Even popular, well-tested apps may fail to properly adapt their behaviors to diverse setting changes, thus frustrating their users. However, there exists no effort to systematically investigate such defects. To this end, we conduct the first empirical study to understand the characteristics of these setting-related defects (in short as "setting defects"), which reside in apps and are triggered by system setting changes. We devote substantial manual effort (over three person-months) to analyze 1,074 setting defects from 180 popular apps on GitHub. We investigate their impact, root causes, and consequences. We find that setting defects have a wide, diverse impact on apps' correctness, and the majority of these defects (≈70.7%) cause non-crash (logic) failures, and thus could not be automatically detected by existing app testing techniques due to the lack of strong test oracles. Motivated and guided by our study, we propose setting-wise metamorphic fuzzing, the first automated testing approach to effectively detect setting defects without explicit oracles. Our key insight is that an app's behavior should, in most cases, remain consistent if a given setting is changed and later properly restored, or exhibit expected differences if not restored. We realize our approach in SetDroid, an automated, end-to-end GUI testing tool, for detecting both crash and non-crash setting defects. SetDroid has been evaluated on 26 popular, open-source apps and detected 42 unique, previously unknown setting defects in 24 apps. To date, 33 have been confirmed and 21 fixed. We also apply SetDroid on five highly popular industrial apps, namely WeChat, QQMail, TikTok, CapCut, and AlipayHK, all of which each have billions of monthly active users. SetDroid successfully detects 17 previously unknown setting defects in these apps' latest releases, and all defects have been confirmed and fixed by the app vendors. The majority of SetDroid-detected defects (49 out of 59) cause non-crash failures, which could not be detected by existing testing tools (as our evaluation confirms). These results demonstrate SetDroid's strong effectiveness and practicality.

HomDroid: detecting Android covert malware by social-network homophily analysis

Yueming Wu
Deqing Zou
Wei Yang
Xiang Li
Hai Jin

Android has become the most popular mobile operating system. Correspondingly, an increasing number of Android malware has been developed and spread to steal users’ private information. There exists one type of malware whose benign behaviors are developed to camouflage malicious behaviors. The malicious component occupies a small part of the entire code of the application (app for short), and the malicious part is strongly coupled with the benign part. In this case, the malware may cause false negatives when malware detectors extract features from the entire apps to conduct classification because the malicious features of these apps may be hidden among benign features. Moreover, some previous work aims to divide the entire app into several parts to discover the malicious part. However, the premise of these methods to commence app partition is that the connections between the normal part and the malicious part are weak (repackaged malware).

In this paper, we call this type of malware as Android covert malware and generate the first dataset of covert malware. To detect covert malware samples, we first conduct static analysis to extract the function call graphs. Through the deep analysis on call graphs, we observe that although the correlations between the normal part and the malicious part in these graphs are high, the degree of these correlations has a unique range of distribution. Based on the observation, we design a novel system, HomDroid, to detect covert malware by analyzing the homophily of call graphs. We identify the ideal threshold of correlation to distinguish the normal part and the malicious part based on the evaluation results on a dataset of 4,840 benign apps and 3,385 covert malicious apps. According to our evaluation results, HomDroid is capable of detecting 96.8% of covert malware while the False Negative Rates of another four state-of-the-art systems (PerDroid, Drebin, MaMaDroid, and IntDroid) are 30.7%, 16.3%, 15.2%, and 10.4%, respectively.

SESSION: Fuzzing

Seed selection for successful fuzzing

Adrian Herrera
Hendra Gunadi
Shane Magrath
Michael Norrish
Mathias Payer
Antony L. Hosking

Mutation-based greybox fuzzing---unquestionably the most widely-used fuzzing technique---relies on a set of non-crashing seed inputs (a corpus) to bootstrap the bug-finding process. When evaluating a fuzzer, common approaches for constructing this corpus include: (i) using an empty file; (ii) using a single seed representative of the target's input format; or (iii) collecting a large number of seeds (e.g., by crawling the Internet). Little thought is given to how this seed choice affects the fuzzing process, and there is no consensus on which approach is best (or even if a best approach exists).

To address this gap in knowledge, we systematically investigate and evaluate how seed selection affects a fuzzer's ability to find bugs in real-world software. This includes a systematic review of seed selection practices used in both evaluation and deployment contexts, and a large-scale empirical evaluation (over 33 CPU-years) of six seed selection approaches. These six seed selection approaches include three corpus minimization techniques (which select the smallest subset of seeds that trigger the same range of instrumentation data points as a full corpus).

Our results demonstrate that fuzzing outcomes vary significantly depending on the initial seeds used to bootstrap the fuzzer, with minimized corpora outperforming singleton, empty, and large (in the order of thousands of files) seed sets. Consequently, we encourage seed selection to be foremost in mind when evaluating/deploying fuzzers, and recommend that (a) seed choice be carefully considered and explicitly documented, and (b) never to evaluate fuzzers with only a single seed.

Gramatron: effective grammar-aware fuzzing

Prashast Srivastava
Mathias Payer

Fuzzers aware of the input grammar can explore deeper program states using grammar-aware mutations. Existing grammar-aware fuzzers are ineffective at synthesizing complex bug triggers due to: (i) grammars introducing a sampling bias during input generation due to their structure, and (ii) the current mutation operators for parse trees performing localized small-scale changes. Gramatron uses grammar automatons in conjunction with aggressive mutation operators to synthesize complex bug triggers faster. We build grammar automatons to address the sampling bias. It restructures the grammar to allow for unbiased sampling from the input state space. We redesign grammar-aware mutation operators to be more aggressive, i.e., perform large-scale changes. Gramatron can consistently generate complex bug triggers in an efficient manner as compared to using conventional grammars with parse trees. Inputs generated from scratch by Gramatron have higher diversity as they achieve up to 24.2% more coverage relative to existing fuzzers. Gramatron makes input generation 98% faster and the input representations are 24% smaller. Our redesigned mutation operators are 6.4× more aggressive while still being 68% faster at performing these mutations. We evaluate Gramatron across three interpreters with 10 known bugs consisting of three complex bug triggers and seven simple bug triggers against two Nautilus variants. Gramatron finds all the complex bug triggers reliably and faster. For the simple bug triggers, Gramatron outperforms Nautilus four out of seven times. To demonstrate Gramatron’s effectiveness in the wild, we deployed Gramatron on three popular interpreters for a 10-day fuzzing campaign where it discovered 10 new vulnerabilities.

QFuzz: quantitative fuzzing for side channels

Yannic Noller
Saeid Tizpaz-Niari

Side channels pose a significant threat to the confidentiality of software systems. Such vulnerabilities are challenging to detect and evaluate because they arise from non-functional properties of software such as execution times and require reasoning on multiple execution traces. Recently, noninterference notions have been adapted in static analysis, symbolic execution, and greybox fuzzing techniques. However, noninterference is a strict notion and may reject security even if the strength of information leaks are weak. A quantitative notion of security allows for the relaxation of noninterference and tolerates small (unavoidable) leaks. Despite progress in recent years, the existing quantitative approaches have scalability limitations in practice.

In this work, we present QFuzz, a greybox fuzzing technique to quantitatively evaluate the strength of side channels with a focus on min entropy. Min entropy is a measure based on the number of distinguishable observations (partitions) to assess the resulting threat from an attacker who tries to compromise secrets in one try. We develop a novel greybox fuzzing equipped with two partitioning algorithms that try to maximize the number of distinguishable observations and the cost differences between them.

We evaluate QFuzz on a large set of benchmarks from existing work and real-world libraries (with a total of 70 subjects). QFuzz compares favorably to three state-of-the-art detection techniques. QFuzz provides quantitative information about leaks beyond the capabilities of all three techniques. Crucially, we compare QFuzz to a state-of-the-art quantification tool and find that QFuzz significantly outperforms the tool in scalability while maintaining similar precision. Overall, we find that our approach scales well for real-world applications and provides useful information to evaluate resulting threats. Additionally, QFuzz identifies a zero-day side-channel vulnerability in a security critical Java library that has since been confirmed and fixed by the developers.

SESSION: Data Processing Application Analysis

SAND: a static analysis approach for detecting SQL antipatterns

Yingjun Lyu
Sasha Volokh
William G. J. Halfond
Omer Tripp

Local databases underpin important features in many mobile applications, such as responsiveness in the face of poor connectivity. However, failure to use such databases correctly can lead to high resource consumption or even security vulnerabilities. We present SAND, an extensible static analysis approach that checks for misuse of local databases, also known as SQL antipatterns, in mobile apps. SAND features novel abstractions for common forms of application/database interactions, which enables concise and precise specification of the antipatterns that SAND checks for. To validate the efficacy of SAND, we have experimented with a diverse suite of 1,000 Android apps. We show that the abstractions that power SAND allow concise specification of all the known antipatterns from the literature (12-74 LOC), and that the antipatterns are modeled accurately (99.4-100% precision). As for performance, SAND requires on average 41 seconds to complete a scan on a mobile app.

Semantic table structure identification in spreadsheets

Yakun Zhang
Xiao Lv
Haoyu Dong
Wensheng Dou
Shi Han
Dongmei Zhang
Jun Wei
Dan Ye

Spreadsheets are widely used in various business tasks, and contain amounts of valuable data. However, spreadsheet tables are usually organized in a semi-structured way, and contain complicated semantic structures, e.g., header types and relations among headers. Lack of documented semantic table structures, existing data analysis and error detection tools can hardly understand spreadsheet tables. Therefore, identifying semantic table structures in spreadsheet tables is of great importance, and can greatly promote various analysis tasks on spreadsheets.

In this paper, we propose Tasi (Table structure identification) to automatically identify semantic table structures in spreadsheets. Based on the contents, styles, and spatial locations in table headers, Tasi adopts a multi-classifier to predict potential header types and relations, and then integrates all header types and relations into consistent semantic table structures. We further propose TasiError, to detect spreadsheet errors based on the identified semantic table structures by Tasi. Our experiments on real-world spreadsheets show that, Tasi can precisely identify semantic table structures in spreadsheets, and TasiError can detect real-world spreadsheet errors with higher precision (75.2%) and recall (82.9%) than existing approaches.

SESSION: Dynamic Analysis

Runtime detection of memory errors with smart status

Zhe Chen
Chong Wang
Junqi Yan
Yulei Sui
Jingling Xue

C is a dominant language for implementing system software. Unfortunately, its support for low-level control of memory often leads to memory errors. Dynamic analysis tools, which have been widely used for detecting memory errors at runtime, are not yet satisfactory as they cannot deterministically and completely detect some types of memory errors, e.g., segment confusion errors, sub-object overflows, use-after-frees, and memory leaks.

We propose Smatus, short for smart status, a new dynamic analysis approach that supports comprehensive runtime detection of memory errors. The key innovation is to create and maintain a small status node for each memory object. Our approach tracks not only the bounds of each pointer’s referent but also the status and reference count of the referent in its status node, where the status represents the liveness and segment type of the referent. A status node is smart as it is automatically destroyed when it becomes useless. To the best of our knowledge, Smatus represents the most comprehensive approach of its kind. In terms of effectiveness (for detecting more kinds of errors), Smatus outperforms state-of-the-art tools, Google’s AddressSanitizer, SoftBoundCETS and Valgrind. In terms of performance, Smatus outperforms SoftBoundCETS and Valgrind in terms of both time and memory overheads incurred, and is on par with AddressSanitizer in terms of the time and memory overheads tradeoff (with much lower memory overhead incurred).

UAFSan: an object-identifier-based dynamic approach for detecting use-after-free vulnerabilities

Binfa Gui
Wei Song
Jeff Huang

Use-After-Free (UAF) vulnerabilities constitute severe threats to software security. In contrast to other memory errors, UAFs are more difficult to detect through manual or static analysis due to pointer aliases and complicated relationships between pointers and objects. Existing evidence-based dynamic detection approaches only track either pointers or objects to record the availability of objects, which become invalid when the memory that stored the freed object is reallocated. To this end, we propose an approach UAFSan dedicated to comprehensively detecting UAFs at runtime. Specifically, we assign a unique identifier to each newly-allocated object and its pointers; when a pointer dereferences a memory object, we determine whether a UAF occurs by checking the consistency of their identifiers. We implement UAFSan in an open-source tool and evaluate it on a large collection of popular benchmarks and real-world programs. The experiment results demonstrate that UAFSan successfully detect all UAFs with reasonable overhead, whereas existing publicly-available dynamic detectors all miss certain UAFs.

SESSION: Symbolic Execution

Fuzzing SMT solvers via two-dimensional input space exploration

Peisen Yao
Heqing Huang
Wensheng Tang
Qingkai Shi
Rongxin Wu
Charles Zhang

Satisfiability Modulo Theories (SMT) solvers serve as the core engine of many techniques, such as symbolic execution. Therefore, ensuring the robustness and correctness of SMT solvers is critical. While fuzzing is an efficient and effective method for validating the quality of SMT solvers, we observe that prior fuzzing work only focused on generating various first-order formulas as the inputs but neglected the algorithmic configuration space of an SMT solver, which leads to under-reporting many deeply-hidden bugs. In this paper, we present Falcon, a fuzzing technique that explores both the formula space and the configuration space. Combining the two spaces significantly enlarges the search space and makes it challenging to detect bugs efficiently. We solve this problem by utilizing the correlations between the two spaces to reduce the search space, and introducing an adaptive mutation strategy to boost the search efficiency. During six months of extensive testing, Falcon finds 518 confirmed bugs in CVC4 and Z3, two state-of-the-art SMT solvers, 469 of which have already been fixed. Compared to two state-of-the-art fuzzers, Falcon detects 38 and 44 more bugs and improves the coverage by a large margin in 24 hours of testing.

Boosting symbolic execution via constraint solving time prediction (experience paper)

Sicheng Luo
Hui Xu
Yanxiang Bi
Xin Wang
Yangfan Zhou

Symbolic execution is an essential approach for automated test case generation. However, the approach is generally not scalable to large programs. One critical reason is that the constraint solving problems in symbolic execution are generally hard. Consequently, the symbolic execution process may get stuck in solving such hard problems. To mitigate this issue, symbolic execution tools generally rely on a timeout threshold to terminate the solving. Such a timeout is generally set to a fixed, predefined value, e.g., five minutes in angr. Nevertheless, how to set a proper timeout is critical to the tool’s efficiency. This paper proposes an approach to tackle the problem by predicting the time required for solving a constraint model so that the symbolic execution engine could base on the information to determine whether to continue the current solving process. Due to the cost of the prediction itself, our approach triggers the predictor only when the solving time has exceeded a relatively small value. We have shown that such a predictor can achieve promising performance with several different machine learning models and datasets. By further employing an adaptive design, the predictor can achieve an F1-score ranging from 0.743 to 0.800 on these datasets. We then apply the predictor to eight programs and conduct simulation experiments. Results show that the efficiency of constraint solving for symbolic execution can be improved by 1.25x to 3x, depending on the distribution of the hardness of their constraint models.

Synthesize solving strategy for symbolic execution

Zhenbang Chen
Zehua Chen
Ziqi Shuai
Guofeng Zhang
Weiyu Pan
Yufeng Zhang
Ji Wang

Symbolic execution is powered by constraint solving. The advancement of constraint solving boosts the development and the applications of symbolic execution. Modern SMT solvers provide the mechanism of solving strategy that allows the users to control the solving procedure, which significantly improves the solver's generalization ability. We observe that the symbolic executions of different programs are actually different constraint solving problems. Therefore, we propose synthesizing a solving strategy for a program to fit the program's symbolic execution best. To achieve this, we divide symbolic execution into two stages. The SMT formulas solved in the first stage are used to online synthesize a solving strategy, which is then employed during the constraint solving in the second stage. We propose novel synthesis algorithms that combine offline trained deep learning models and online tuning to synthesize the solving strategy. The algorithms balance the synthesis overhead and the improvement achieved by the synthesized solving strategy.

We have implemented our method on the state-of-the-art symbolic execution engine KLEE for C programs. The results of the extensive experiments indicate that our method effectively improves the efficiency of symbolic execution. On average, our method increases the numbers of queries and paths by 58.76% and 66.11%, respectively. Besides, we applied our method to a Java Pathfinder-based concolic execution engine to validate the generalization ability. The results indicate that our method has a good generalization ability and increases the numbers of queries and paths by 100.24% and 102.6% for the benchmark Java programs, respectively.

Type and interval aware array constraint solving for symbolic execution

Ziqi Shuai
Zhenbang Chen
Yufeng Zhang
Jun Sun
Ji Wang

Array constraints are prevalent in analyzing a program with symbolic execution. Solving array constraints is challenging due to the complexity of the precise encoding for arrays. In this work, we propose to synergize symbolic execution and array constraint solving. Our method addresses the difficulties in solving array constraints with novel ideas. First, we propose a lightweight method for pre-checking the unsatisfiability of array constraints based on integer linear programming. Second, observing that encoding arrays at the byte-level introduces many redundant axioms that reduce the effectiveness of constraint solving, we propose type and interval aware axiom generation. Note that the type information of array variables is inferred by symbolic execution, whereas interval information is calculated through the above pre-checking step. We have implemented our methods based on KLEE and its underlying constraint solver STP and conducted large-scale experiments on 75 real-world programs. The experimental results show that our method effectively improves the efficiency of symbolic execution. Our method solves 182.56% more constraints and explores 277.56% more paths on average under the same time threshold.

Grammar-agnostic symbolic execution by token symbolization

Weiyu Pan
Zhenbang Chen
Guofeng Zhang
Yunlai Luo
Yufeng Zhang
Ji Wang

Parsing code exists extensively in software. Symbolic execution of complex parsing programs is challenging. The inputs generated by the symbolic execution using the byte-level symbolization are usually rejected by the parsing program, which dooms the effectiveness and efficiency of symbolic execution. Complex parsing programs usually adopt token-based input grammar checking. A token sequence represents one case of the input grammar. Based on this observation, we propose grammar-agnostic symbolic execution that can automatically generate token sequences to test complex parsing programs effectively and efficiently. Our method's key idea is to symbolize tokens instead of input bytes to improve the efficiency of symbolic execution. Technically, we propose a novel two-stage algorithm: the first stage collects the byte-level constraints of token values; the second stage employs token symbolization and the constraints collected in the first stage to generate the program inputs that are more possible to pass the parsing code.

We have implemented our method on a Java Pathfinder (JPF) based concolic execution engine. The results of the extensive experiments on real-world Java parsing programs demonstrate the effectiveness and efficiency in testing complex parsing programs. Our method detects 6 unknown bugs in the benchmark programs and achieves orders of magnitude speedup to find the same bugs.

SESSION: Machine Learning and Testing

Interval constraint-based mutation testing of numerical specifications

Clothilde Jeangoudoux
Eva Darulova
Christoph Lauter

Mutation testing is an established approach for checking whether code satisfies a code-independent functional specification, and for evaluating whether a test set is adequate. Current mutation testing approaches, however, do not account for accuracy requirements that appear with numerical specifications implemented in floating- point arithmetic code, but which are a frequent part of safety-critical software. We present Magneto, an instantiation of mutation testing that fully automatically generates a test set from a real-valued specification. The generated tests check numerical code for accuracy, robustness and functional behavior bugs. Our technique is based on formulating test case and oracle generation as a constraint satisfaction problem over interval domains, which soundly bounds errors, but is nonetheless efficient. We evaluate Magneto on a standard floating-point benchmark set and find that it outperforms a random testing baseline for producing useful adequate test sets.

Predoo: precision testing of deep learning operators

Xufan Zhang
Ning Sun
Chunrong Fang
Jiawei Liu
Jia Liu
Dong Chai
Jiang Wang
Zhenyu Chen

Deep learning(DL) techniques attract people from various fields with superior performance in making progressive breakthroughs. To ensure the quality of DL techniques, researchers have been working on testing and verification approaches. Some recent studies reveal that the underlying DL operators could cause defects inside a DL model. DL operators work as fundamental components in DL libraries. Library developers still work on practical approaches to ensure the quality of operators they provide. However, the variety of DL operators and the implementation complexity make it challenging to evaluate their quality. Operator testing with limited test cases may fail to reveal hidden defects inside the implementation. Besides, the existing model-to-library testing approach requires extra labor and time cost to identify and locate errors, i.e., developers can only react to the exposed defects. This paper proposes a fuzzing-based operator-level precision testing approach to estimate individual DL operators' precision errors to bridge this gap. Unlike conventional fuzzing techniques, valid shape variable inputs and fine-grained precision error evaluation are implemented. The testing of DL operators is treated as a searching problem to maximize output precision errors. We implement our approach in a tool named Predoo and conduct an experiment on seven DL operators from TensorFlow. The experiment result shows that Predoo can trigger larger precision errors compared to the error threshold declared in the testing scripts from the TensorFlow repository.

TERA: optimizing stochastic regression tests in machine learning projects

Saikat Dutta
Jeeva Selvam
Aryaman Jain
Sasa Misailovic

The stochastic nature of many Machine Learning (ML) algorithms makes testing of ML tools and libraries challenging. ML algorithms allow a developer to control their accuracy and run-time through a set of hyper-parameters, which are typically manually selected in tests. This choice is often too conservative and leads to slow test executions, thereby increasing the cost of regression testing.

We propose TERA, the first automated technique for reducing the cost of regression testing in Machine Learning tools and libraries(jointly referred to as projects) without making the tests more flaky. TERA solves the problem of exploring the trade-off space between execution time of the test and its flakiness as an instance of Stochastic Optimization over the space of algorithm hyper-parameters. TERA presents how to leverage statistical convergence-testing techniques to estimate the level of flakiness of the test for a specific choice of hyper-parameters during optimization.

We evaluate TERA on a corpus of 160 tests selected from 15 popular machine learning projects. Overall, TERA obtains a geo-mean speedup of 2.23x over the original tests, for the minimum passing probability threshold of 99%. We also show that the new tests did not reduce fault detection ability through a mutation study and a study on a set of 12 historical build failures in studied projects.

Deep just-in-time defect prediction: how far are we?

Zhengran Zeng
Yuqun Zhang
Haotian Zhang
Lingming Zhang

Defect prediction aims to automatically identify potential defective code with minimal human intervention and has been widely studied in the literature. Just-in-Time (JIT) defect prediction focuses on program changes rather than whole programs, and has been widely adopted in continuous testing. CC2Vec, state-of-the-art JIT defect prediction tool, first constructs a hierarchical attention network (HAN) to learn distributed vector representations of both code additions and deletions, and then concatenates them with two other embedding vectors representing commit messages and overall code changes extracted by the existing DeepJIT approach to train a model for predicting whether a given commit is defective. Although CC2Vec has been shown to be the state of the art for JIT defect prediction, it was only evaluated on a limited dataset and not compared with all representative baselines. Therefore, to further investigate the efficacy and limitations of CC2Vec, this paper performs an extensive study of CC2Vec on a large-scale dataset with over 310,370 changes (8.3 X larger than the original CC2Vec dataset). More specifically, we also empirically compare CC2Vec against DeepJIT and representative traditional JIT defect prediction techniques. The experimental results show that CC2Vec cannot consistently outperform DeepJIT, and neither of them can consistently outperform traditional JIT defect prediction. We also investigate the impact of individual traditional defect prediction features and find that the added-line-number feature outperforms other traditional features. Inspired by this finding, we construct a simplistic JIT defect prediction approach which simply adopts the added-line-number feature with the logistic regression classifier. Surprisingly, such a simplistic approach can outperform CC2Vec and DeepJIT in defect prediction, and can be 81k X/120k X faster in training/testing. Furthermore, the paper also provides various practical guidelines for advancing JIT defect prediction in the near future.

SESSION: Configuration and Dependency

Fixing dependency errors for Python build reproducibility

Suchita Mukherjee
Abigail Almanza
Cindy Rubio-González

Software reproducibility is important for re-usability and the cumulative progress of research. An important manifestation of unreproducible software is the changed outcome of software builds over time. While enhancing code reuse, the use of open-source dependency packages hosted on centralized repositories such as PyPI can have adverse effects on build reproducibility. Frequent updates to these packages often cause their latest versions to have breaking changes for applications using them. Large Python applications risk their historical builds becoming unreproducible due to the widespread usage of Python dependencies, and the lack of uniform practices for dependency version specification. Manually fixing dependency errors requires expensive developer time and effort, while automated approaches face challenges of parsing unstructured build logs, finding transitive dependencies, and exploring an exponential search space of dependency versions. In this paper, we investigate how open-source Python projects specify dependency versions, and how their reproducibility is impacted by dependency packages. We propose a tool PyDFix to detect and fix unreproducibility in Python builds caused by dependency errors. PyDFix is evaluated on two bug datasets BugSwarm and BugsInPy, both of which are built from real-world open-source projects. PyDFix analyzes a total of 2,702 builds, identifying 1,921 (71.1%) of them to be unreproducible due to dependency errors. From these, PyDFix provides a complete fix for 859 (44.7%) builds, and partial fixes for an additional 632 (32.9%) builds.

Test-case prioritization for configuration testing

Runxiang Cheng
Lingming Zhang
Darko Marinov
Tianyin Xu

Configuration changes are among the dominant causes of failures of large-scale software system deployment. Given the velocity of configuration changes, typically at the scale of hundreds to thousands of times daily in modern cloud systems, checking these configuration changes is critical to prevent failures due to misconfigurations. Recent work has proposed configuration testing, Ctest, a technique that tests configuration changes together with the code that uses the changed configurations. Ctest can automatically generate a large number of ctests that can effectively detect misconfigurations, including those that are hard to detect by traditional techniques. However, running ctests can take a long time to detect misconfigurations. Inspired by traditional test-case prioritization (TCP) that aims to reorder test executions to speed up detection of regression code faults, we propose to apply TCP to reorder ctests to speed up detection of misconfigurations. We extensively evaluate a total of 84 traditional and novel ctest-specific TCP techniques. The experimental results on five widely used cloud projects demonstrate that TCP can substantially speed up misconfiguration detection. Our study provides guidelines for applying TCP to configuration testing in practice.

The impact of tool configuration spaces on the evaluation of configurable taint analysis for Android

Austin Mordahl
Shiyi Wei

The most popular static taint analysis tools for Android allow users to change the underlying analysis algorithms through configuration options. However, the large configuration spaces make it difficult for developers and users alike to understand the full capabilities of these tools, and studies to-date have only focused on individual configurations. In this work, we present the first study that evaluates the configurations in Android taint analysis tools, focusing on the two most popular tools, FlowDroid and DroidSafe. First, we perform a manual code investigation to better understand how configurations are implemented in both tools. We formalize the expected effects of configuration option settings in terms of precision and soundness partial orders which we use to systematically test the configuration space. Second, we create a new dataset of 756 manually classified flows across 18 open-source real-world apps and conduct large-scale experiments on this dataset and micro-benchmarks. We observe that configurations make significant tradeoffs on the performance, precision, and soundness of both tools. The studies to-date would reach different conclusions on the tools' capabilities were they to consider configurations or use real-world datasets. In addition, we study the individual options through a statistical analysis and make actionable recommendations for users to tune the tools to their own ends. Finally, we use the partial orders to test the tool configuration spaces and detect 21 instances where options behaved in unexpected and incorrect ways, demonstrating the need for rigorous testing of configuration spaces.

Challenges and opportunities: an in-depth empirical study on configuration error injection testing

Wang Li
Zhouyang Jia
Shanshan Li
Yuanliang Zhang
Teng Wang
Erci Xu
Ji Wang
Xiangke Liao

Configuration error injection testing (CEIT) could systematically evaluate software reliability and diagnosability to runtime configuration errors. This paper explores the challenges and opportunities of applying CEIT technique. We build an extensible, highly-modularized CEIT framework named CeitInspector to experiment with various CEIT techniques. Using CeitInspector, we quantitatively measure the effectiveness and efficiency of CEIT using six mature and widely-used server applications. During this process, we find a fair number of test cases are left unstudied by the prior research work. The injected configuration errors in these cases often indicate latent misconfigurations, which might be ticking time bombs in the system and lead to severe damage. We conduct an in-depth study regarding these cases to reveal the root causes, and explore possible remedies. Finally, we come up with actionable suggestions guided by our study to improve the effectiveness and efficiency of the existing CEIT techniques.

SESSION: Testing

Empirically evaluating readily available information for regression test optimization in continuous integration

Daniel Elsner
Florian Hauer
Alexander Pretschner
Silke Reimer

Regression test selection (RTS) and prioritization (RTP) techniques aim to reduce testing efforts and developer feedback time after a change to the code base. Using various information sources, including test traces, build dependencies, version control data, and test histories, they have been shown to be effective. However, not all of these sources are guaranteed to be available and accessible for arbitrary continuous integration (CI) environments. In contrast, metadata from version control systems (VCSs) and CI systems are readily available and inexpensive. Yet, corresponding RTP and RTS techniques are scattered across research and often only evaluated on synthetic faults or in a specific industrial context. It is cumbersome for practitioners to identify insights that apply to their context, let alone to calibrate associated parameters for maximum cost-effectiveness. This paper consolidates existing work on RTP and unsafe RTS into an actionable methodology to build and evaluate such approaches that exclusively rely on CI and VCS metadata. To investigate how these approaches from prior research compare in heterogeneous settings, we apply the methodology in a large-scale empirical study on a set of 23 projects covering 37,000 CI logs and 76,000 VCS commits. We find that these approaches significantly outperform established RTP baselines and, while still triggering 90% of the failures, we show that practitioners can expect to save on average 84% of test execution time for unsafe RTS. We also find that it can be beneficial to limit training data, features from test history work better than change-based features, and, somewhat surprisingly, simple and well-known heuristics often outperform complex machine-learned models.

Toward optimal mc/dc test case generation

Sangharatna Godboley
Joxan Jaffar
Rasool Maghareh
Arpita Dutta

MC/DC coverage prescribes a set of MC/DC sequences. Such a sequence is defined by a specification of the truth values of certain atomic boolean expressions which appear in predicates (i.e. boolean combinations of atomic boolean expressions) in the program. An execution trace satisfies the sequence if it realizes the atomic boolean conditions in accordance with the truth value specification of the sequence. An MC/DC sequence is feasible if there is one such execution trace. The overall goal for an MC/DC test generator is, for each sequence: if feasible, to generate a test input realizing the sequence; otherwise, to prove that the sequence is infeasible.

In this paper, we propose a method whose aim is optimal MC/DC coverage for bounded programs, i.e. for each MC/DC sequence, the method either produces a test input, or proves that sequence is infeasible. The method is based on symbolic execution with interpolation, and in this paper, we present a customized interpolation algorithm. We then present a comprehensive experimental evaluation comparing with the only available system CBMC which can operate on reasonably large programs, and further, which can provide optimal coverage for many examples. We will use a benchmark based on RERS which contains the kinds of reactive programs for which MC/DC was motivated by. We show that our method, by a significant margin, surpasses CBMC. In particular, our method often produces an optimal MC/DC result.

Log-based slicing for system-level test cases

Salma Messaoudi
Donghwan Shin
Annibale Panichella
Domenico Bianculli
Lionel C. Briand

Regression testing is arguably one of the most important activities in software testing. However, its cost-effectiveness and usefulness can be largely impaired by complex system test cases that are poorly designed (e.g., test cases containing multiple test scenarios combined into a single test case) and that require a large amount of time and resources to run. One way to mitigate this issue is decomposing such system test cases into smaller, separate test cases---each of them with only one test scenario and with its corresponding assertions---so that the execution time of the decomposed test cases is lower than the original test cases, while the test effectiveness of the original test cases is preserved. This decomposition can be achieved with program slicing techniques, since test cases are software programs too. However, existing static and dynamic slicing techniques exhibit limitations when (1) the test cases use external resources, (2) code instrumentation is not a viable option, and (3) test execution is expensive.

In this paper, we propose a novel approach, called DS3 (Decomposing System teSt caSe), which automatically decomposes a complex system test case into separate test case slices. The idea is to use test case execution logs, obtained from past regression testing sessions, to identify "hidden" dependencies in the slices generated by static slicing. Since logs include run-time information about the system under test, we can use them to extract access and usage of global resources and refine the slices generated by static slicing.

We evaluated DS3 in terms of slicing effectiveness and compared it with a vanilla static slicing tool. We also compared the slices obtained by DS3 with the corresponding original system test cases, in terms of test efficiency and effectiveness. The evaluation results on one proprietary system and one open-source system show that DS3 is able to accurately identify the dependencies related to the usage of global resources, which vanilla static slicing misses. Moreover, the generated test case slices are, on average, 3.56 times faster than original system test cases and they exhibit no significant loss in terms of fault detection effectiveness.

Model-based testing of networked applications

Yishuai Li
Benjamin C. Pierce
Steve Zdancewic

We present a principled automatic testing framework for application-layer protocols. The key innovation is a domain-specific embedded language for writing nondeterministic models of the behavior of networked servers. These models are defined within the Coq interactive theorem prover, supporting a smooth transition from testing to formal verification.

Given a server model, we show how to automatically derive a tester that probes the server for unexpected behaviors. We address the uncertainties caused by both the server's internal choices and the network delaying messages nondeterministically. The derived tester accepts server implementations whose possible behaviors are a subset of those allowed by the nondeterministic model.

We demonstrate the effectiveness of this framework by using it to specify and test a fragment of the HTTP/1.1 protocol, showing that the automatically derived tester can capture RFC violations in buggy server implementations, including the latest versions of Apache and Nginx.

Validating static warnings via testing code fragments

Ashwin Kallingal Joshy
Xueyuan Chen
Benjamin Steenhoek
Wei Le

Static analysis is an important approach for finding bugs and vulnerabilities in software. However, inspecting and confirming static warnings are challenging and time-consuming. In this paper, we present a novel solution that automatically generates test cases based on static warnings to validate true and false positives. We designed a syntactic patching algorithm that can generate syntactically valid, semantic preserving executable code fragments from static warnings. We developed a build and testing system to automatically test code fragments using fuzzers, KLEE and Valgrind. We evaluated our techniques using 12 real-world C projects and 1955 warnings from two commercial static analysis tools. We successfully built 68.5% code fragments and generated 1003 test cases. Through automatic testing, we identified 48 true positives and 27 false positives, and 205 likely false positives. We matched 4 CVE and real-world bugs using Helium, and they are only triggered by our tool but not other baseline tools. We found that testing code fragments is scalable and useful; it can trigger bugs that testing entire programs or testing procedures failed to trigger.

Continuous test suite failure prediction

Cong Pan
Michael Pradel

Continuous integration advocates to run the test suite of a project frequently, e.g., for every code change committed to a shared repository. This process imposes a high computational cost and sometimes also a high human cost, e.g., when developers must wait for the test suite to pass before a change appears in the main branch of the shared repository. However, only 4% of all test suite invocations turn a previously passing test suite into a failing test suite. The question arises whether running the test suite for each code change is really necessary. This paper presents continuous test suite failure prediction, which reduces the cost of continuous integration by predicting whether a particular code change should trigger the test suite at all. The core of the approach is a machine learning model based on features of the code change, the test suite, and the development history. We also present a theoretical cost model that describes when continuous test suite failure prediction is worthwhile. Evaluating the idea with 15k test suite runs from 242 open-source projects shows that the approach is effective at predicting whether running the test suite is likely to reveal a test failure. Moreover, we find that our approach improves the AUC over baselines that use features proposed for just-in-time defect prediction and test case failure prediction by 13.9% and 2.9%, respectively. Overall, continuous test suite failure prediction can significantly reduce the cost of continuous integration.

Empirical evaluation of smart contract testing: what is the best choice?

Meng Ren
Zijing Yin
Fuchen Ma
Zhenyang Xu
Yu Jiang
Chengnian Sun
Huizhong Li
Yan Cai

Security of smart contracts has attracted increasing attention in recent years. Many researchers have devoted themselves to devising testing tools for vulnerability detection. Each published tool has demonstrated its effectiveness through a series of evaluations on their own experimental scenarios. However, the inconsistency of evaluation settings such as different data sets or performance metrics, may result in biased conclusion.

In this paper, based on an empirical evaluation of widely used smart contract testing tools, we propose a unified standard to eliminate the bias in the assessment process. First, we collect 46,186 source-available smart contracts from four influential organizations. This comprehensive dataset is open to the public and involves different code characteristics, vulnerability patterns and application scenarios. Then we propose a 4-step evaluation process and summarize the difference among relevant work in these steps. We use nine representative tools to carry out extensive experiments. The results demonstrate that different choices of experimental settings could significantly affect tool performance and lead to misleading or even opposite conclusions. Finally, we generalize some problems of existing testing tools, and propose some possible directions for further improvement.

SESSION: Binary Analysis

iDEV: exploring and exploiting semantic deviations in ARM instruction processing

Shisong Qin
Chao Zhang
Kaixiang Chen
Zheming Li

ARM has become the most competitive processor architecture. Many platforms or tools are developed to execute or analyze ARM instructions, including various commercial CPUs, emulators, and binary analysis tools. However, they have deviations when processing the same ARM instructions, and little attention has been paid to systematically analyze such semantic deviations, not to mention the security implications of such deviations. In this paper, we conduct an empirical study on the ARM Instruction Semantic Deviation (ISDev) issue. First, we classify this issue into several categories and analyze the security implications behind them. Then, we further demonstrate several novel attacks which utilize the ISDev issue, including stealthy targeted attacks and targeted defense evasion. Such attacks could exploit the semantic deviations to generate malware that is specific to certain platforms or able to detect and bypass certain detection solutions. We have developed a framework iDEV to systematically explore the ISDev issue in existing ARM instructions processing tools and platforms via differential testing. We have evaluated iDEV on four hardware devices, the QEMU emulator, and five disassemblers which could process the ARMv7-A instruction set. The evaluation results show that, over six million instructions could cause dynamic executors (i.e., CPUs and QEMU) to present different runtime behaviors, and over eight million instructions could cause static disassemblers yielding different decoding results, and over one million instructions cause inconsistency between dynamic executors and static disassemblers. After analyzing the root causes of each type of deviation, we point out they are mostly due to ARM unpredictable instructions and program defects.

RAProducer: efficiently diagnose and reproduce data race bugs for binaries via trace analysis

Ming Yuan
Yeseop Lee
Chao Zhang
Yun Li
Yan Cai
Bodong Zhao

A growing number of bugs have been reported by vulnerability discovery solutions. Among them, some bugs are hard to diagnose or reproduce, including data race bugs caused by thread interleavings. Few solutions are able to well address this issue, due to the huge space of interleavings to explore. What’s worse, in security analysis scenarios, analysts usually have no access to the source code of target programs and have troubles in comprehending them.

In this paper, we propose a general solution RAProducer to efficiently diagnose and reproduce data race bugs, for both user-land binary programs and kernels without source code. The efficiency of RAProducer is achieved by analyzing the execution trace of the given PoC (proof-of-concept) sample to recognize race- and bug-related elements (including locks and shared variables), which greatly facilitate narrowing down the huge search space of data race spots and thread interleavings. We have implemented a prototype of RAProducer and evaluated it on 7 kernel and 10 user-land data race bugs. Evaluation results showed that, RAProducer is effective at reproducing all these bugs. More importantly, it enables us to diagnose 2 extra real world bugs which are left unconfirmed for a long time. It is also efficient as it reduces candidate data race spots of each bug to a small set, and narrows down the thread interleaving greatly.RAProducer is also more effective in reproducing real-world data race bugs than other state-of-the-art solutions.

A lightweight framework for function name reassignment based on large-scale stripped binaries

Han Gao
Shaoyin Cheng
Yinxing Xue
Weiming Zhang

Software in the wild is usually released as stripped binaries that contain no debug information (e.g., function names). This paper studies the issue of reassigning descriptive names for functions to help facilitate reverse engineering. Since the essence of this issue is a data-driven prediction task, persuasive research should be based on sufficiently large-scale and diverse data. However, prior studies can only be based on small-scale datasets because their techniques suffer from heavyweight binary analysis, making them powerless in the face of big-size and large-scale binaries.

This paper presents the Neural Function Rename Engine (NFRE), a lightweight framework for function name reassignment that utilizes both sequential and structural information of assembly code. NFRE uses fine-grained and easily acquired features to model assembly code, making it more effective and efficient than existing techniques. In addition, we construct a large-scale dataset and present two data-preprocessing approaches to help improve its usability. Benefiting from the lightweight design, NFRE can be efficiently trained on the large-scale dataset, thereby having better generalization capability for unknown functions. The comparative experiments show that NFRE outperforms two existing techniques by a relative improvement of 32% and 16%, respectively, while the time cost for binary analysis is much less.

SESSION: Bugs and Analysis

Finding data compatibility bugs with JSON subschema checking

Andrew Habib
Avraham Shinnar
Martin Hirzel
Michael Pradel

JSON is a data format used pervasively in web APIs, cloud computing, NoSQL databases, and increasingly also machine learning. To ensure that JSON data is compatible with an application, one can define a JSON schema and use a validator to check data against the schema. However, because validation can happen only once concrete data occurs during an execution, it may detect data compatibility bugs too late or not at all. Examples include evolving the schema for a web API, which may unexpectedly break client applications, or accidentally running a machine learning pipeline on incorrect data. This paper presents a novel way of detecting a class of data compatibility bugs via JSON subschema checking. Subschema checks find bugs before concrete JSON data is available and across all possible data specified by a schema. For example, one can check if evolving a schema would break API clients or if two components of a machine learning pipeline have incompatible expectations about data. Deciding whether one JSON schema is a subschema of another is non-trivial because the JSON Schema specification language is rich. Our key insight to address this challenge is to first reduce the richness of schemas by canonicalizing and simplifying them, and to then reason about the subschema question on simpler schema fragments using type-specific checkers. We apply our subschema checker to thousands of real-world schemas from different domains. In all experiments, the approach is correct whenever it gives an answer (100% precision and correctness), which is the case for most schema pairs (93.5% recall), clearly outperforming the state-of-the-art tool. Moreover, the approach reveals 43 previously unknown bugs in popular software, most of which have already been fixed, showing that JSON subschema checking helps finding data compatibility bugs early.

Automated patch backporting in Linux (experience paper)

Ridwan Shariffdeen
Xiang Gao
Gregory J. Duck
Shin Hwei Tan
Julia Lawall
Abhik Roychoudhury

Whenever a bug or vulnerability is detected in the Linux kernel, the kernel developers will endeavour to fix it by introducing a patch into the mainline version of the Linux kernel source tree. However, many users run older “stable” versions of Linux, meaning that the patch should also be “backported” to one or more of these older kernel versions. This process is error-prone and there is usually along delay in publishing the backported patch. Based on an empirical study, we show that around 8% of all commits submitted to Linux mainline are backported to older versions,but often more than one month elapses before the backport is available. Hence, we propose a patch backporting technique that can automatically transfer patches from the mainline version of Linux into older stable versions. Our approach first synthesizes a partial transformation rule based on a Linux mainline patch. This rule can then be generalized by analysing the alignment between the mainline and target versions. The generalized rule is then applied to the target version to produce a backported patch. We have implemented our transformation technique in a tool called FixMorph and evaluated it on 350 Linux mainline patches. FixMorph correctly backports 75.1% of them. Compared to existing techniques, FixMorph improves both the precision and recall in backporting patches. Apart from automation of software maintenance tasks, patch backporting helps in reducing the exposure to known security vulnerabilities in stable versions of the Linux kernel.

Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space

Yicheng Pan
Meng Ma
Xinrui Jiang
Ping Wang

With the widespread use of cloud-native architecture, increasing web applications (apps) choose to build on microservices. Simultaneously, troubleshooting becomes full of challenges owing to the high dynamics and complexity of anomaly propagation. Existing diagnostic methods rely heavily on monitoring metrics collected from the kernel side of microservice systems. Without a comprehensive monitoring infrastructure, application owners and even cloud operators cannot resort to these kernel-space solutions. This paper summarizes several insights on operating a top commercial cloud platform. Then, for the first time, we put forward the idea of user-space diagnosis for microservice kernel failures. To this end, we develop a crowdsourcing solution - DyCause, to resolve the asymmetric diagnostic information problem. DyCause deploys on the application side in a distributed manner. Through lightweight API log sharing, apps collect the operational status of kernel services collaboratively and initiate diagnosis on demand. Deploying DyCause is fast and lightweight as we do not have any architectural and functional requirements for the kernel. To reveal more accurate correlations from asymmetric diagnostic information, we design a novel statistical algorithm that can efficiently discover the time-varying causalities between services. This algorithm also helps us build the temporal order of the anomaly propagation. Therefore, by using DyCause, we can obtain more in-depth and interpretable diagnostic clues with limited indicators. We apply and evaluate DyCause on both a simulated test-bed and a real-world cloud system. Experimental results verify that DyCause running in the user-space outperforms several state-of-the-art algorithms running in the kernel on accuracy. Besides, DyCause shows superior advantages in terms of algorithmic efficiency and data sensitivity. Simply put, DyCause produces a significantly better result than other baselines when analyzing much fewer or sparser metrics. To conclude, DyCause is faster to act, deeper in analysis, and easier to deploy.

SESSION: Demo Papers

echidna-parade: a tool for diverse multicore smart contract fuzzing

Alex Groce
Gustavo Grieco

Echidna is a widely used fuzzer for Ethereum Virtual Machine (EVM) compatible blockchain smart contracts that generates transaction sequences of calls to smart contracts. While Echidna is an essentially single-threaded tool, it is possible for multiple Echidna processes to communicate by use of a shared transaction sequence corpus. Echidna provides a very large variety of configuration options, since each smart contract may be best-tested by a non-default configuration, and different faults or coverage targets within a single contract may also have differing ideal configurations. This paper presents echidna-parade, a tool that provides pushbutton multicore fuzzing using Echidna as an underlying fuzzing engine, and automatically provides sophisticated diversification of configurations. Even without using multiple cores, echidna-parade can improve the effectiveness of fuzzing with Echidna, due to the advantages provided by multiple types of test configuration diversity. Using echidna-parade with multiple cores can produce significantly better results than Echidna, in less time.

ProFuzzBench: a benchmark for stateful protocol fuzzing

Roberto Natella
Van-Thuan Pham

We present a new benchmark (ProFuzzBench) for stateful fuzzing of network protocols. The benchmark includes a suite of representative open-source network servers for popular protocols, and tools to automate experimentation. We discuss challenges and potential directions for future research based on this benchmark.

SCStudio: a secure and efficient integrated development environment for smart contracts

Meng Ren
Fuchen Ma
Zijing Yin
Huizhong Li
Ying Fu
Ting Chen
Yu Jiang

With the increasing popularity of block-chain technologies, more and more engineers use smart contracts for application implementation. Traditional supporting tools can either provide code completions based on static libraries or detect a limited set of vulnerabilities, which results in the manpower waste during coding and miss-detection of bugs. In this work, we propose SCStudio, a unified smart contract development platform, which aims to help developers implement more secure smart contracts easily. The core idea is to realize real-time security-reinforced recommendation through pattern-based learning; and to perform security-oriented validation via integrated testing. SCStudio was implemented as a plug-in of VS Code. It has been used as the official development tool of WeBank and integrated as the recommended development tool by FISCO-BCOS community. In practice, it outperforms existing contract development environments, such as Remix, improving the average word suggestion accuracy by 30%-60% and helping detect about 25% more vulnerabilities.

The video is presented at https://youtu.be/l6hW3Ds5Tkg.

C4: the C compiler concurrency checker

Matt Windsor
Alastair F. Donaldson
John Wickerson

The correct compilation of atomic-action concurrency is vital now that multicore processors are ubiquitous. Despite much recent work on automated compiler testing, little existing tooling can test how real-world compilers handle compilation of atomic-action code. We demonstrate C4, a tool for exploring the concurrency behaviour of real-world C compilers such as GCC and LLVM. C4 automates a workflow based on generating, fuzzing, and executing litmus tests. So far, C4 has found two new control-flow bugs in GCC and IBM XL, and reproduced two historic concurrency bugs in GCC 4.

TauMed: test augmentation of deep learning in medical diagnosis

Yunhan Hou
Jiawei Liu
Daiwei Wang
Jiawei He
Chunrong Fang
Zhenyu Chen

Deep learning has made great progress in medical diagnosis. However, due to data standardization and privacy restriction, the acquisition and sharing of medical image data have been hindered, leading to the unacceptable accuracy of some intelligent medical diagnosis models. Another concern is data quality. If insufficient quantity and low-quality data are used for training and testing medical diagnosis models, it may cause serious medical accidents. We always use data augmentation to deal with it, and one of the most representative ways is through mutation relation. However, although common mutation methods can increase the amount of medical data, the quality of the image cannot be guaranteed due to the particularity of medical image. Therefore, combined with the characteristics of medical images, we propose TauMed, which implements augmentation techniques based on a series of mutation rules and domain semantics on medical datasets to generate sufficient and high-quality images. Moreover, we chose the ResNet-50 model to experiment with the augmented dataset and compared the results with two main popular mutation tools. The experimental result indicates that TauMed can improve the classification accuracy of the model effectively, and the quality of augmented images is higher than the other two tools. Its video is at https://www.youtube.com/watch?v=O8W8I7U_eqk and TauMed can be used at http://121.196.124.158:9500/.

MoScan: a model-based vulnerability scanner for web single sign-on services

Hanlin Wei
Behnaz Hassanshahi
Guangdong Bai
Padmanabhan Krishnan
Kostyantyn Vorobyov

Various third-party single sign-on (SSO) services (e.g., Facebook Login and Twitter Login) are widely deployed by web applications to facilitate their authentication and authorization processes. Nevertheless, integrating these services in a secure manner remains challenging, such that security issues are continually reported in recent years. In this work, we develop MoScan, a model-based scanner that can be used by software testers and security analysts for detecting and reporting security vulnerabilities in SSO implementations. MoScan takes as input a state machine built based on an SSO standard and our empirical study to represent participants' states and transitions during the login process. In the testing process, it analyzes network traces captured during the execution of SSO services, and increments the state machine which is then used to generate payloads to test the protocol participants. We evaluate MoScan with 23 real-world websites which integrate the Facebook SSO service to test its capability of identifying security vulnerabilities. To show the adaptability of MoScan's state machine, we also test it on Twitter and LinkedIn’s SSO services, and Github's authentication plugin in Jenkins. It detects three known weaknesses and one new logic fault from them, showing a new perspective in testing stateful protocol implementations like SSO services. Our demonstration and the source code of MoScan are available at https://github.com/baigd/moscan.

RESTest: automated black-box testing of RESTful web APIs

Alberto Martin-Lopez
Sergio Segura
Antonio Ruiz-Cortés

Testing RESTful APIs thoroughly is critical due to their key role in software integration. Existing tools for the automated generation of test cases in this domain have shown great promise, but their applicability is limited as they mostly rely on random inputs, i.e., fuzzing. In this paper, we present RESTest, an open source black-box testing framework for RESTful web APIs. Based on the API specification, RESTest supports the generation of test cases using different testing techniques such as fuzzing and constraint-based testing, among others. RESTest is developed as a framework and can be easily extended with new test case generators and test writers for different programming languages. We evaluate the tool in two scenarios: offline and online testing. In the former, we show how RESTest can efficiently generate realistic test cases (test inputs and test oracles) that uncover bugs in real-world APIs. In the latter, we show RESTest's capabilities as a continuous testing and monitoring framework. Demo video: https://youtu.be/1f_tjdkaCKo.