ICPC '18- Proceedings of the 26th Conference on Program Comprehension

Full Citation in the ACM Digital Library

SESSION: Keynote

Mining the mind, minding the mine: grand challenges in comprehension and mining

The program comprehension and mining software repository communities are, in practice, two separate research endeavors. One is concerned with what's happening in a developer's mind, while the other is concerned with what's happening in a team. And yet, implicit in these fields is a common goal to make better software and the common approach of influencing developer decisions. In this keynote, I provide several examples of this overlap, suggesting several grand challenges in comprehension and mining.

SESSION: Vision keynote

Sensing and supporting software developers' focus

Software developers regularly have to focus in order to successfully perform their work. At the same time, developers experience many disruptions to their focus, especially in today's highly demanding, collaborative and open office work environments. When these disruptions happen during tasks that require a lot of focus, such as comprehending a difficult piece of source code, they can be very costly, causing a decrease in performance and quality. By sensing how focused a developer is, we might be able to reduce the cost of such disruptions.

In our previous work, we investigated the use of biometric and computer interaction sensors to sense interruptibility---the availability for interruptions---and developed the FlowLight approach---a traffic light like LED indicator of a person's interruptibility---to reduce the cost of external in-person interruptions, a particularly expensive kind of disruption. Our results demonstrate the potential of accurately sensing interruptibility in the field and of reducing external interruption cost to increase focus and productivity of knowledge workers.

Overcoming language dichotomies: toward effective program comprehension for mobile app development

Mobile devices and platforms have become an established target for modern software developers due to performant hardware and a large and growing user base numbering in the billions. Despite their popularity, the software development process for mobile apps comes with a set of unique, domain-specific challenges rooted in program comprehension. Many of these challenges stem from developer difficulties in reasoning about different representations of a program, a phenomenon we define as a "language dichotomy". In this paper, we reflect upon the various language dichotomies that contribute to open problems in program comprehension and development for mobile apps. Furthermore, to help guide the research community towards effective solutions for these problems, we provide a roadmap of directions for future work.

SESSION: Most influential paper award

Adventures in NICAD: a ten-year retrospective

Based on the simple, naive idea of text-line differencing of pretty-printed code, at ICPC 2008 we introduced NICAD [5], the first code clone detector explicitly aimed at finding intentional "near-miss" (Type 3) clones. Using the TXL [2] parser to identify and pretty-print all instances of a code unit of interest (functions, blocks, etc.), NICAD provides several ways to pre-process the code before comparison, including flexible formatting, renaming, normalization and abstraction, making it suitable for finding all kinds of clones in a wide range of different applications. In this talk we will outline the journey from that initial naive idea to an efficient, scalable, flexible clone detection tool that handles more than ten different languages with high accuracy in both precision and recall [8]. Along the way we will highlight our experience in tuning our initial prototype to production speed and scalability [4], we will review its application in a range of large-scale clone experiments [3, 6, 7], and describe its evolution to handle new domains such as subsystem clones in graphical models [1]. Finally, we will close with new methods based on NICAD [9, 10] and its lessons for clone detection research in the future.

SESSION: Technical research

Meaningful variable names for decompiled code: a machine translation approach

When code is compiled, information is lost, including some of the structure of the original source code as well as local identifier names. Existing decompilers can reconstruct much of the original source code, but typically use meaningless placeholder variables for identifier names. Using variable names which are more natural in the given context can make the code much easier to interpret, despite the fact that variable names have no effect on the execution of the program. In theory, it is impossible to recover the original identifier names since that information has been lost. However, most code is natural: it is highly repetitive and predictable based on the context. In this paper we propose a technique that assigns variables meaningful names by taking advantage of this naturalness property. We consider decompiler output to be a noisy distortion of the original source code, where the original source code is transformed into the decompiler output. Using this noisy channel model, we apply standard statistical machine translation approaches to choose natural identifiers, combining a translation model trained on a parallel corpus with a language model trained on unmodified C code. We generate a large parallel corpus from 1.2 TB of C source code obtained from GitHub. Under the most conservative assumptions, our technique is still able to recover the original variable names up to 16.2% of the time, which represents a lower bound for performance.

Descriptive compound identifier names improve source code comprehension

Reading and understanding source code is a major task in software development. Code comprehension depends on the quality of code, which is impacted by code structure and identifier naming. In this paper we empirically investigated whether longer but more descriptive identifier names improve code comprehension compared to short names, as they represent useful information in more detail. In a web-based study 88 Java developers were asked to locate a semantic defect in source code snippets. With descriptive identifier names, developers spent more time in the lines of code before the actual defect occurred and changed their reading direction less often, finding the semantic defect about 14% faster than with shorter but less descriptive identifier names. These effects disappeared when developers searched for a syntax error, i.e., when no in-depth understanding of the code was required. Interestingly, the style of identifier names had a clear impact on program comprehension for more experienced developers but not for less experienced developers.

Un-break my build: assisting developers with build repair hints

Continuous integration is an agile software development practice. Instead of integrating features right before a release, they are constantly being integrated in an automated build process. This shortens the release cycle, improves software quality, and reduces time to market. However, the whole process will come to a halt when a commit breaks the build, which can happen for several reasons, e.g., compilation errors or test failures, and fixing the build suddenly becomes a top priority. Developers not only have to find the cause of the build break and fix it, but they have to be quick in all of it to avoid a delay for others. Unfortunately, these steps require deep knowledge and are often time consuming. To support developers in fixing a build break, we propose Bart, a tool that summarizes the reasons of the build failure and suggests possible solutions found on the Internet. We will show in a case study with eight participants that developersfind Bart useful to understand build breaks and that using Bart substantially reduces the time to fix a build break, on average by 41%.

Aiding comprehension of unit test cases and test suites with stereotype-based tagging

Techniques to automatically identify the stereotypes of different software artifacts (e.g., classes, methods, commits) were previously presented. Those approaches utilized the techniques to support comprehension of software artifacts, but those stereotype-based approaches were not designed to consider the structure and purpose of unit tests, which are widely used in software development to increase the quality of source code. Moreover, unit tests are different than production code, since they are designed and written by following different principles and workflows.

In this paper, we present a novel approach, called TeStereo, for automated tagging of methods in unit tests. The tagging is based on an original catalog of stereotypes that we have designed to improve the comprehension and navigation of unit tests in a large test suite. The stereotype tags are automatically selected by using static control-flow, data-flow, and API call based analyses. To evaluate the benefits of the stereotypes and the tagging reports, we conducted a study with 46 students and another survey with 25 Apache developers to (i) validate the accuracy of the inferred stereotypes, (ii) measure the usefulness of the stereotypes when writing/understanding unit tests, and (iii) collect feedback on the usefulness of the generated tagging reports.

JIT feedback: what experienced developers like about static analysis

Although software developers are usually reluctant to use static analysis to detect issues in their source code, our automatic just-in-time static analysis assistant was integrated into an Integrated Development Environment, and was evaluated positively by its users. We conducted interviews to understand the impact of the tool on experienced developers, and how it performs in comparison with other static analyzers.

We learned that the availability of our tool as a default IDE feature and its automatic execution are the main reasons for its adoption. Moreover, the fact that immediate feedback is provided directly in the related development context is essential to keeping developers satisfied, although in certain cases feedback delivered later was deemed more useful. We also discovered that static analyzers can play an educational role, especially in combination with domain-specific rules.

How do design decisions affect the distribution of software metrics?

Background. Source code analysis techniques usually rely on metric-based assessment. However, most of these techniques have low accuracy. One possible reason is because metric thresholds are extracted from classes driven by distinct design decisions. Previous studies have already shown that classes implemented according to some coarse-grained design decisions, such as programming languages, have different distribution of metric values. Therefore, these design decisions should be taken into account when using benchmarks for metric-based source code analysis. Goal. Our goal is to investigate whether other fine-grained design decisions also influence over distribution of software metrics. Method. We conduct an empirical study to evaluate the distributions of four metrics applied over fifteen real-world systems based on three different domains. Initially, we evaluated the influence of the class design role on the distributions of measures. For this purpose, we have defined an automatic approach to identify the design role played by each class. Then, we looked for other fine-grained design decisions that could have influenced the measures. Results. Our findings show that distribution of metrics are sensitive to the following design decisions: (i) design role of the class (ii) used libraries, (iii) coding style, (iv) exception handling, and (v) logging and debugging code mechanisms. Conclusion. The distribution of software metrics are sensitive to fine-grained design decisions and we should consider taking them into account when building benchmarks for metric-based source code analysis.

Hierarchical abstraction of execution traces for program comprehension

Understanding the dynamic behavior of a software system is one of the most important and time-consuming tasks for today's software maintainers. In practice, understanding the inner workings of software requires studying the source code and documentation and inserting logging code in order to map high-level descriptions of the program behavior with low-level implementation, i.e., the source code. Unfortunately, for large codebases and large log files, such cognitive mapping can be quite challenging. To bridge the cognitive gap between the source code and detailed models of program behavior, we propose a fully automatic approach to present a semantic abstraction with different levels of functional granularity from full execution traces. Our approach builds multi-level abstractions and identifies frequent behaviors at each level based on a number of execution traces, and then, it labels phases within individual execution traces according to the identified major functional behaviors of the system. To validate our approach, we conducted a case study on a large-scale subject program, Javac, to demonstrate the effectiveness of the mining result. Furthermore, the results of a user study demonstrate that our approach is capable of presenting users a high-level comprehensible abstraction of execution behavior. Based on a real world subject program the participants in our user study were able to achieve a mean accuracy of 70%.

Component interface identification and behavioral model discovery from software execution data

Restructuring an object-oriented software system into a component-based one allows for a better understanding of the system and facilitates its future maintenance. A component-based architecture structures a software system in terms of its components and interactions where each component refers to a set of classes. To represent the architectural interaction, each component provides a set of interfaces. Existing interface identification approaches are mostly structure-oriented rather than function-oriented. In this paper, we propose an approach to identify interfaces of a component according to the functional interaction information that is recorded in the software execution data. In addition, we also discover the contract (represented as a behavioral model) for each identified interface by using process mining techniques to help understand how each interface actually works. All proposed approaches have been implemented in the open source process mining toolkit ProM. Using a set of software execution data containing more than 650.000 method calls generated from three software systems, we evaluate our approach against three existing interface identification approaches. The empirical evaluation demonstrates that our approach can discover more functionally consistent interfaces which facilitate the reconstruction of architectural models with higher quality.

Recognizing software bug-specific named entity in software bug repository

Software bug issues are unavoidable in software development and maintenance. In order to manage bugs effectively, bug tracking systems are developed to help to record, manage and track the bugs of each project. The rich information in the bug repository provides the possibility of establishment of entity-centric knowledge bases to help understand and fix the bugs. However, existing named entity recognition (NER) systems deal with text that is structured, formal, well written, with a good grammatical structure and few spelling errors, which cannot be directly used for bug-specific named entity recognition. For bug data, they are free-form texts, which include a mixed language studded with code, abbreviations and software-specific vocabularies. In this paper, we summarize the characteristics of bug entities, propose a classification method for bug entities, and build a baseline corpus on two open source projects (Mozilla and Eclipse). On this basis, we propose an approach for bug-specific entity recognition called BNER with the Conditional Random Fields (CRF) model and word embedding technique. An empirical study is conducted to evaluate the accuracy of our BNER technique, and the results show that the two designed baseline corpus are suitable for bug-specific named entity recognition, and our BNER approach is effective on cross-projects NER.

Recommending frequently encountered bugs

Developers introduce bugs during software development which reduce software reliability. Many of these bugs are commonly occurring and have been experienced by many other developers. Informing developers, especially novice ones, about commonly occurring bugs in a domain of interest (e.g., Java), can help developers comprehend program and avoid similar bugs in the future. Unfortunately, information about commonly occurring bugs are not readily available. To address this need, we propose a novel approach named RFEB which recommends frequently encountered bugs (FEBugs) that may affect many other developers. RFEB analyzes Stack Overflow which is the largest software engineering-specific Q&A communities. Among the plenty of questions posted in Stack Overflow, many of them provide the descriptions and solutions of different kinds of bugs. Unfortunately, the search engine that comes with Stack Overflow is not able to identify FEBugs well. To address the limitation of the search engine of Stack Overflow, we propose RFEB which is an integrated and iterative approach that considers both relevance and popularity of Stack Overflow questions to identify FEBugs. To evaluate the performance of RFEB, we perform experiments on a dataset from Stack Overflow which contains more than ten million posts. We compared our model with Stack Overflow's search engine on 10 domains, and the experiment results show that RFEB achieves the average NDCG10 score of 0.96, which improves Stack Overflow's search engine by 20%.

Cross version defect prediction with representative data via sparse subset selection

Software defect prediction aims at detecting the defect-prone software modules by mining historical development data from software repositories. If such modules are identified at the early stage of the development, it can save large amounts of resources. Cross Version Defect Prediction (CVDP) is a practical scenario by training the classification model on the historical data of the prior version and then predicting the defect labels of modules of the current version. However, software development is a constantly-evolving process which leads to the data distribution differences across versions within the same project. The distribution differences will degrade the performance of the classification model. In this paper, we approach this issue by leveraging a state-of-the-art Dissimilarity-based Sparse Subset Selection (DS3) method. This method selects a representative module subset from the prior version based on the pairwise dissimilarities between the modules of two versions and assigns each module of the current version to one of the representative modules. These selected modules can well represent the modules of the current version, thus mitigating the distribution differences. We evaluate the effectiveness of DS3 for CVDP performance on total 40 cross-version pairs from 56 versions of 15 projects with three traditional and two effort-aware indicators. The extensive experiments show that DS3 outperforms three baseline methods, especially in terms of two effort-aware indicators.

Unsupervised deep bug report summarization

Bug report summarization is an effective way to reduce the considerable time in wading through numerous bug reports. Although some supervised and unsupervised algorithms have been proposed for this task, their performance is still limited, due to the particular characteristics of bug reports, including the evaluation behaviours in bug reports, the diverse sentences in software language and natural language, and the domain-specific predefined fields. In this study, we conduct the first exploration of the deep learning network on bug report summarization. Our approach, called DeepSum, is a novel stepped auto-encoder network with evaluation enhancement and predefined fields enhancement modules, which successfully integrates the bug report characteristics into a deep neural network. DeepSum is unsupervised. It significantly reduces the efforts on labeling huge training sets. Extensive experiments show that DeepSum outperforms the comparative algorithms by up to 13.2% and 9.2% in terms of F-score and Rouge-n metrics respectively over the public datasets, and achieves the state-of-the-art performance. Our work shows promising prospects for deep learning to summarize millions of bug reports.

Analysis of test log information through interactive visualizations

A fundamental activity to achieve software quality is software testing, whose results are typically stored in log files. These files contain the richest and more detailed source of information for developers trying to understand failures and identify their potential causes. Analyzing and understanding the information presented in log files, however, can be a complex task, depending on the amount of errors and the variety of information. Some previously proposed tools try to visualize test information, but they have limited interaction and present a single perspective of such data. This paper presents ASTRO, an infrastructure that extracts information from a number of log files and presents it in multi-perspective interactive visualizations that aim at easing and improving the developers' analysis process. A study carried out with practitioners from 3 software test factories indicated that ASTRO helps to analyze information of interest, with less accuracy in tasks that involved assimilation of information from different perspectives. Based on their difficulties, participants also provided feedback for improving the tool.

A search-based approach for accurate identification of log message formats

Many software engineering activities process the events contained in log files. However, before performing any processing activity, it is necessary to parse the entries in a log file, to retrieve the actual events recorded in the log. Each event is denoted by a log message, which is composed of a fixed part---called (event) template---that is the same for all occurrences of the same event type, and a variable part, which may vary with each event occurrence. The formats of log messages, in complex and evolving systems, have numerous variations, are typically not entirely known, and change on a frequent basis; therefore, they need to be identified automatically.

The log message format identification problem deals with the identification of the different templates used in the messages of a log. Any solution to this problem has to generate templates that meet two main goals: generating templates that are not too general, so as to distinguish different events, but also not too specific, so as not to consider different occurrences of the same event as following different templates; however, these goals are conflicting.

In this paper, we present the MoLFI approach, which recasts the log message identification problem as a multi-objective problem. MoLFI uses an evolutionary approach to solve this problem, by tailoring the NSGA-II algorithm to search the space of solutions for a Pareto optimal set of message templates. We have implemented MoLFI in a tool, which we have evaluated on six real-world datasets, containing log files with a number of entries ranging from 2K to 300K. The experiments results show that MoLFI extracts by far the highest number of correct log message templates, significantly outperforming two state-of-the-art approaches on all datasets.

Logtracker: learning log revision behaviors proactively from software evolution history

Log statements are widely used for postmortem debugging. Despite the importance of log messages, it is difficult for developers to establish good logging practices. There are two main reasons for this. First, there are no rigorous specifications or systematic processes to guide the practices of software logging. Second, logging code co-evolves with bug fixes or feature updates. While previous works on log enhancement have successfully focused on the first problem, they are hard to solve the latter. For taking the first step towards solving the second problem, this paper is inspired by code clones and assumes that logging code with similar context is pervasive in software and deserves similar modifications. To verify our assumptions, we conduct an empirical study on eight open-source projects. Based on the observation, we design and implement LogTracker, an automatic tool that can predict log revisions by mining the correlation between logging context and modifications. With an enhanced modeling of logging context, LogTracker is able to guide more intricate log revisions that cannot be covered by existing tools. We evaluate the effectiveness of LogTracker by applying it to the latest version of subject projects. The results of our experiments show that LogTracker can detect 199 instances of log revisions. So far, we have reported 25 of them, and 6 have been accepted.

Identifying software components from object-oriented APIs based on dynamic analysis

The reuse at the component level is generally more effective than the one at the object-oriented class level. This is due to the granularity level where components expose their functionalities at an abstract level compared to the fine-grained object-oriented classes. Moreover, components clearly define their dependencies through their provided and required interfaces in an explicit way that facilitates the understanding of how to reuse these components. Therefore, several component identification approaches have been proposed to identify components based on the analysis object-oriented software applications. Nevertheless, most of the existing component identification approaches did not consider co-usage dependencies between API classes to identify classes/methods that can be reused to implement a specific scenario. In this paper, we propose an approach to identify reusable software components in object-oriented APIs, based on the interactions between client applications and the targeted API. As we are dealing with actual clients using the API, dynamic analysis allows to better capture the instances of API usage. Approaches using static analysis are usually limited by the difficulty of handling dynamic features such as polymorphism and class loading. We evaluate our approach by applying it to three Java APIs with eight client applications from the DaCapo benchmark. DaCapo provides a set of pre-defined usage scenarios. The results show that our component identification approach has a very high precision.

Deep code comment generation

During software maintenance, code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in the software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named DeepCom to automatically generate code comments for Java methods. The generated comments aim to help developers understand the functionality of Java methods. DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. We use a deep neural network that analyzes structural information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects from GitHub. We evaluate the experimental results on a machine translation metric. Experimental results demonstrate that our method DeepCom outperforms the state-of-the-art by a substantial margin.

Automatically classifying posts into question categories on stack overflow

Software developers frequently solve development issues with the help of question and answer web forums, such as Stack Overflow (SO). While tags exist to support question searching and browsing, they are more related to technological aspects than to the question purposes. Tagging questions with their purpose can add a new dimension to the investigation of topics discussed in posts on SO. In this paper, we aim to automate such a classification of SO posts into seven question categories. As a first step, we have manually created a curated data set of 500 SO posts, classified into the seven categories. Using this data set, we apply machine learning algorithms (Random Forest and Support Vector Machines) to build a classification model for SO questions. We then experiment with 82 different configurations regarding the preprocessing of the text and representation of the input data. The results of the best performing models show that our models can classify posts into the correct question category with an average precision and recall of 0.88 and 0.87 when using Random Forest and the phrases indicating a question category as input data for the training. The obtained model can be used to aid developers in browsing SO discussions or researchers in building recommenders based on SO.

Automatic tag recommendation for software development video tutorials

Software development video tutorials are emerging as a new resource for developers to support their information needs. However, when trying to find the right video to watch for a task at hand, developers have little information at their disposal to quickly decide if they found the right video or not. This can lead to missing the best tutorials or wasting time watching irrelevant ones.

Other external sources of information for developers, such as StackOverflow, have benefited from the existence of informative tags, which help developers to quickly gauge the relevance of posts and find related ones. We argue that the same is valid also for videos and propose the first set of approaches to automatically generate tags describing the contents of software development video tutorials. We investigate seven tagging approaches for this purpose, some using information retrieval techniques and leveraging only the information in the videos, others relying on external sources of information, such as StackOverflow, as well as two out-of-the-box commercial video tagging approaches. We evaluated 19 different configurations of these tagging approaches and the results of a user study showed that some of the information retrieval-based approaches performed the best and were able to recommend tags that developers consider relevant for describing programming videos.

Classification of APIs by hierarchical clustering

APIs can be classified according to the programming domains (e.g., GUIs, databases, collections, or security) that they address. Such classification is vital in searching repositories (e.g., the Maven Central Repository for Java) and for understanding the technology stack used in software projects. We apply hierarchical clustering to a curated suite of Java APIs to compare the computed API clusters with preexisting API classifications. Clustering entails various parameters (e.g., the choice of IDF versus LSI versus LDA). We describe the corresponding variability in terms of a feature model. We exercise all possible configurations to determine the maximum correlation with respect to two baselines: i) a smaller suite of APIs manually classified in previous research; ii) a larger suite of APIs from the Maven Central Repository, thereby taking advantage of crowd-sourced classification while relying on a threshold-based approach for identifying important APIs and versions thereof, subject to an API dependency analysis on GitHub. We discuss the configurations found in this way and we examine the influence of particular features on the correlation between computed clusters and baselines. To this end, we also leverage interactive exploration of the parameter space and the resulting dendrograms. In this manner, we can also identify issues with the use of classifiers (e.g., missing classifiers) in the baselines and limitations of the clustering approach.

LESdroid: a tool for detecting exported service leaks of Android applications

Services are widely used in Android apps. However, services may leak such that they are no longer used but cannot be recycled by the Garbage Collector. Service leaks may cause an app to misbehave, and are vulnerable to malicious external apps when the service is exported or it is accessible through other exported services. In this paper, we present LESDroid for exported service leaks detection. LESDroid automatically generates service instances and workloads (start/stop or bind/unbind of exported services) of the app under test, and applies a designated oracle to the heap snapshot for service leak detection. We evaluated LESDroid using 375 commercial apps, and found 97 leaked services and 98 distinct leak entries in 70 apps.

Do developers update third-party libraries in mobile apps?

One of the most common strategies to develop new software is to take advantage of existing source code, which is available in comprehensive packages called third-party libraries. As for all software systems, even these libraries change to offer new functionalities and fix bugs or security issues. The way the changes are propagated has been studied by researchers, interested in understanding their impact on the non-functional attributes of the systems source code. While the research community mainly focused on the change propagation phenomenon in the context of traditional applications, only little is known regarding the mobile context. In this paper, we aim at bridging this gap by conducting an empirical study on the evolution history of 291 mobile apps, by investigating (i) whether mobile developers actually update third-party libraries, (ii) which are the categories of libraries with respect to the developers' proneness to update their apps, (iii) what are the common patterns followed by developers when updating a software library, and (iv) whether high- and low-rated apps present peculiar update patterns. The results of the study showed that mobile developers rarely update their apps with respect to the used libraries, and when they do, they mainly tend to update the libraries related to the Graphical User Interface, with the aim of keeping the mobile apps updated with the latest design tendencies. In some cases developers ignore updates because of a poor awareness of the benefits, or a too high cost/benefit ratio. Finally, high- and low-rated apps present strong differences.

What's inside my app?: understanding feature redundancy in mobile apps

As the number of mobile apps increases rapidly, many users may install dozens of, or even hundreds of, apps on a single smartphone. However, many apps on the same phone may contain similar or even the same feature, resulting in feature redundancy. For example, multiple apps may check weather forecast for the user periodically. Feature redundancy may cause many undesirable side-effects such as consuming extra CPU resources and network traffic. This paper proposes a method to identify common features within an app, and evaluated it on over four thousand popular apps. Experiments on a list of apps installed on actual smartphones show that the extent of feature redundancy is very high. We found that more than 85% of user smartphones contain redundant features, while in extreme cases, some smartphones may contain dozens of apps with the same feature. In addition, our user surveys found out that about half of the redundant features are undesirable from the end users' perspective, which indicates that feature redundancy has become an important issue that needs to be investigated further.

Impacts of coding practices on readability

Several conventions and standards aim to improve maintainability of software code. However, low levels of code readability perceived by developers still represent a barrier to their daily work. In this paper, we describe a survey that assessed the impact of a set of Java coding practices on the readability perceived by software developers. While some practices promoted an enhancement of readability, others did not show statistically significant effects. Interestingly, one of the practices worsened the readability. Our results may help to identify coding conventions with a positive impact on readability and, thus, guide the creation of coding standards.

The effect of poor source code lexicon and readability on developers' cognitive load

It has been well documented that a large portion of the cost of any software lies in the time spent by developers in understanding a program's source code before any changes can be undertaken. One of the main contributors to software comprehension, by subsequent developers or by the authors themselves, has to do with the quality of the lexicon, (i.e., the identifiers and comments) that is used by developers to embed domain concepts and to communicate with their teammates. In fact, previous research shows that there is a positive correlation between the quality of identifiers and the quality of a software project. Results suggest that poor quality lexicon impairs program comprehension and consequently increases the effort that developers must spend to maintain the software. However, we do not yet know or have any empirical evidence, of the relationship between the quality of the lexicon and the cognitive load that developers experience when trying to understand a piece of software. Given the associated costs, there is a critical need to empirically characterize the impact of the quality of the lexicon on developers' ability to comprehend a program.

In this study, we explore the effect of poor source code lexicon and readability on developers' cognitive load as measured by a cutting-edge and minimally invasive functional brain imaging technique called functional Near Infrared Spectroscopy (fNIRS). Additionally, while developers perform software comprehension tasks, we map cognitive load data to source code identifiers using an eye tracking device. Our results show that the presence of linguistic antipatterns in source code significantly increases the developers' cognitive load.

Assessing an architecture's ability to support feature evolution

Enabling rapid feature delivery is essential for product success and is therefore a goal of software architecture design. But how can we determine if and to what extent an architecture is "good enough" to support feature addition and evolution, or determine if a refactoring effort is successful in that features can be added more easily? In this paper, we contribute a concept called the Feature Space, and a formal definition of Feature Dependency, derived from a software project's revision history. We capture the dependency relations among the features of a system in a feature dependency structure matrix (FDSM), using features as first-class design elements. We also propose a Feature Decoupling Level (FDL) metric that can be used to measure the level of independence among features. Our investigation of 17 open source projects shows that files within each feature space are much more likely to be changed together, hence each feature space forms a meaningful maintainable unit that should be treated separately. The data also show that the history-based FDL is highly correlated a structure-based maintainability metric: Decoupling Level (DL). When we examine a project's evolution history, we see that if a system is well-modularized, it is more likely that features can be added independently. For shorter periods of time, however, FDL and DL may not be consistent, e.g., when the addition of new features deviates from the designed architecture or does not involve parts of the system that have architecture flaws. In such cases, FDL and FDSM can be used to monitor potential architecture degradation caused by improper feature addition.

SESSION: Early research achievement

Code phonology: an exploration into the vocalization of code

When children learn to read, they almost invariably start with oral reading: reading the words and sentences out loud. Experiments have shown that when novices read text aloud, their comprehension is better then when reading in silence. This is attributed to the fact that reading aloud focuses the child's attention to the text. We hypothesize that reading code aloud could support program comprehension in a similar way, encouraging novice programmers to pay attention to details. To this end we explore how novices read code, and we found that novice programmers vocalize code in different ways, sometimes changing vocalization within a code snippet. We thus believe that in order to teach novices to read code aloud, an agreed upon way of reading code is needed. As such, this paper proposes studying code phonology, ultimately leading to a shared understanding about how code should be read aloud, such that this can be practiced. In addition to being valuable as an educational and diagnostic tool for novices, we believe that pair programmers could also benefit from standardized communication about code, and that it could support improved tools for visually and physically disabled programmers.

Towards just-in-time refactoring recommenders

Empirical studies have provided ample evidence that low code quality is generally associated with lower maintainability. For this reason, tools have been developed to automatically detect design flaws (e.g., code smells). However, these tools are not able to prevent the introduction of design flaws. This means that the code has to experience a quality decay (with a consequent increase of maintenance/evolution costs) before state-of-the-art tools can be applied to identify and refactor the design flaws.

Our goal is to develop a new generation of refactoring recommenders aimed at preventing, via refactoring operations, the introduction of design flaws rather than fixing them once they already affect the system. We refer to such a novel perspective on software refactoring as just-in-time refactoring. In this paper, we make a first step towards this direction, presenting an approach able to predict which classes will be affected in the future by code smells.

Toward refactoring evaluation with code naturalness

Refactoring evaluation is a challenging research topic because right and wrong of refactoring depend on various aspects of development context such as developers' skills, development cost, deadline and so on. Many techniques have been proposed to evaluate refactoring objectively. However, those techniques do not consider individual contexts of software development. Currently, the authors are trying to evaluate refactoring automatically and objectively with considering development contexts. In this paper, we propose to evaluate refactoring with code naturalness. Our technique is based on a hypothesis: if a given refactoring raises the naturalness of existing code, the refactoring is beneficial. In this paper, we also report our pilot study on open source software.

Replicomment: identifying clones in code comments

Code comments are the primary means to document implementation and ease program comprehension. Thus, their quality should be a primary concern to improve program maintenance. While a lot of effort has been dedicated to detect bad smell in code, little work focuses on comments. In this paper we start working in this direction by detecting clones in comments. Our initial investigation shows that even well known projects have several comment clones, and just as clones are bad smell in code, they may be for comments. A manual analysis of the clones we identified revealed several issues in real Java projects.

A preliminary study on using code smells to improve bug localization

Bug localization is a technique that has been proposed to support the process of identifying the locations of bugs specified in a bug report. A traditional approach such as information retrieval (IR)-based bug localization calculates the similarity between the bug description and the source code and suggests locations that are likely to contain the bug. However, while many approaches have been proposed to improve the accuracy, the likelihood of each module having a bug is often overlooked or they are treated equally, whereas this may not be the case. For example, modules having code smells have been found to be more prone to changes and faults. Therefore, in this paper, we explore a first step toward leveraging code smells to improve bug localization. By combining the code smell severity with the textual similarity from IR-based bug localization, we can identify the modules that are not only similar to the bug description but also have a higher likelihood of containing bugs. Our preliminary evaluation on four open source projects shows that our technique can improve the baseline approach by 142.25% and 30.50% on average for method and class levels, respectively.

What design topics do developers discuss?

When contributing code to a software system, developers are often confronted with the hard task of understanding and adhering to the system's design. This task is often made more difficult by the lack of explicit design information. Often, recorded design information occurs only embedded in discussions between developers. If this design information could be identified automatically and put into a form useful to developers, many development tasks could be eased, such as directing questions that arise during code review, tracking design changes that might affect desired system qualities, and helping developers understand why the code is as it is. In this paper, we take an initial step towards this goal, considering how design information appears in pull request discussions and manually categorizing 275 paragraphs from those discussions that contain design information to learn about what kinds of design topics are discussed.

Toward introducing automated program repair techniques to industrial software development

Automated program repair (in short, APR) has been attracting much attention. A variety of APR techniques have been proposed, and they have been evaluated with actual bugs in open source software. Currently, the authors are trying to introduce APR techniques to industrial software development (in short, ISD) to reduce development cost drastically. However, at this moment, there are no studies that report evaluations of APR techniques on ISD. In this paper, we report our ongoing application of APR techniques to ISD and discuss some barriers that we found on the application.

Learning lexical features of programming languages from imagery using convolutional neural networks

We demonstrate the ability of deep architectures, specifically convolutional neural networks, to learn and differentiate the lexical features of different programming languages presented in coding video tutorials found on the Internet. We analyze over 17,000 video frames containing examples of Java, Python, and other textual and non-textual objects. Our results indicate that not only can computer vision models based on deep architectures be taught to differentiate among programming languages with over 98% accuracy, but can learn language-specific lexical features in the process. This provides a powerful mechanism for carrying out program comprehension research on repositories where source code is represented with imagery rather than text, while simultaneously avoiding the computational overhead of optical character recognition.

On the naturalness of auto-generated code: can we identify auto-generated code automatically?

Recently, a variety of studies have been conducted on source code analysis. If auto-generated code is included in the target source code, it is usually removed in a preprocessing phase because the presence of auto-generated code may have negative effects on source code analysis. A straightforward way to remove auto-generated code is searching special comments that are included in the files of auto-generated code. However, it becomes impossible to identify auto-generated code with the way if such special comments have disappeared for some reasons. It is obvious that it takes too much effort to see source files one by one manually. In this paper, we propose a new technique to identify auto-generated code by using the naturalness of auto-generated code. We used a golden set that includes thousands of hand-made source files and source files generated by four kinds of compiler-compilers. Through the evaluation with the dataset, we confirmed that our technique was able to identify auto-generated code with over 99% precision and recall for all the cases.

Augmenting source code lines with sample variable values

Source code is inherently abstract, which makes it difficult to understand. Activities such as debugging can reveal concrete runtime details, including the values of variables. However, they require that a developer explicitly requests these data for a specific execution moment. We present a simple approach, RuntimeSamp, which collects sample variable values during normal executions of a program by a programmer. These values are then displayed in an ambient way at the end of each line in the source code editor. We discuss questions which should be answered for this approach to be usable in practice, such as how to efficiently record the values and when to display them. We provide partial answers to these questions and suggest future research directions.

An empirical investigation on the readability of manual and generated test cases

Software testing is one of the most crucial tasks in the typical development process. Developers are usually required to write unit test cases for the code they implement. Since this is a time-consuming task, in last years many approaches and tools for automatic test case generation --- such as EvoSuite--- have been introduced. Nevertheless, developers have to maintain and evolve tests to sustain the changes in the source code; therefore, having readable test cases is important to ease such a process. However, it is still not clear whether developers make an effort in writing readable unit tests. Therefore, in this paper, we conduct an explorative study comparing the readability of manually written test cases with the classes they test. Moreover, we deepen such analysis looking at the readability of automatically generated test cases. Our results suggest that developers tend to neglect the readability of test cases and that automatically generated test cases are generally even less readable than manually written ones.

SESSION: Industry

How slim will my system be?: estimating refactored code size by merging clones

We have been doing code clone analysis with industry collaborators for a long time, and have been always asked a question, "OK, I understand my system contains a lot of code clones, but how slim will it be after merging redundant code clones?" As a software system evolves for long period, it would increasingly contain many code clones due to quick bug fix and new feature addition. Industry collaborators would recognize decay of initial design simplicity, and try to evaluate current system from the view point of maintenance effort and cost. As one of resources for the evaluation, the estimated code size by merging code clone is very important for them. In this paper, we formulate this issue as "slimming" problem, and present three different slimming methods, Basic, Complete, and Heuristic Methods, each of which gives a lower bound, upper bound, and modest reduction rates, respectively. Application of these methods to OSS systems written in C/C++ showed that the reduction rate is at most 5.7% of the total size, and to a commercial COBOL system, it is at most 15.4%. For this approach, we have gotten initial but very positive feedback from industry collaborators.

Codecompass: an open software comprehension framework for industrial usage

CodeCompass is an open source LLVM/Clang-based tool developed by Ericsson Ltd. and Eötvös Loránd University, Budapest to help the understanding of large legacy software systems. Based on the LLVM/Clang compiler infrastructure, CodeCompass gives exact information on complex C/C++ language elements like overloading, inheritance, the usage of variables and types, possible uses of function pointers and virtual functions - features that various existing tools support only partially. Steensgaard's and Andersen's pointer analysis algorithms are used to compute and visualize the use of pointers/references. The wide range of interactive visualizations extends further than the usual class and function call diagrams; architectural, component and interface diagrams are a few of the implemented graphs. To make comprehension more extensive, CodeCompass also utilizes build information to explore the system architecture as well as version control information.

CodeCompass is regularly used by hundreds of designers and developers. Having a web-based, pluginable, extensible architecture, the CodeCompass framework can be an open platform to further code comprehension, static analysis and software metrics efforts. The source code and a tutorial is publicly available on GitHub, and a live demo is also available online.

Leveraging the agile development process for selecting invoking/excluding tests to support feature location

A practical approach to feature location using agile unit tests is presented. The approach employs a modified software reconnaissance method for feature location, but in the context of an agile development methodology. Whereas a major drawback to software reconnaissance is the identification or development of invoking and excluding tests, the approach allows for the automatic identification of invoking and excluding tests by partially ordering existing agile unit tests via iteration information from the agile development process. The approach is validated in a comparison study with industry professionals, where the approach is shown to improve feature location speed, accuracy, and developer confidence over purely manual feature location.

SESSION: Tool demonstration

SDexplorer: a generic toolkit for smoothly exploring massive-scale sequence diagram

To understand program's behavior, using reverse-engineered sequence diagram is a valuable technique. In practice, researchers usually record execution traces and generate a sequence diagram according to them. However, the diagram can be too large to read while treating real-world software due to the massiveness of execution traces.

Several studies on minimizing/compressing sequence diagrams have been proposed; however, the resulting diagram may be either still large or losing important information. Besides, existing tools are highly customized for a certain research purpose. To address these problems, we present a generic toolkit SDExplorer in this paper, which is a flexible and lightweight tool to effectively explore a massive-scale sequence diagram in a highly scalable manner. Additionally, SDExplorer supports popular features of existing tools (i.e. search, filter, grouping, etc.). We believe it is an easy-to-use and promising tool in future research to evaluate and compare the minimizing/compressing techniques in real maintenance tasks.

SDExplorer is available at https://lyukx.github.io/SDExplorer/.

CoBOT: static C/C++ bug detection in the presence of incomplete code

To obtain precise and sound results, most of existing static analyzers require whole program analysis with complete source code. However, in reality, the source code of an application always interacts with many third-party libraries, which are often not easily accessible to static analyzers. Worse still, more than 30% of legacy projects [1] cannot be compiled easily due to complicated configuration environments (e.g., third-party libraries, compiler options and macros), making ideal "whole-program analysis" unavailable in practice. This paper presents CoBOT [2], a static analysis tool that can detect bugs in the presence of incomplete code. It analyzes function APIs unavailable in application code by either using function summarization or automatically downloading and analyzing the corresponding library code as inferred from the application code and its configuration files. The experiments show that CoBOT is not only easy to use, but also effective in detecting bugs in real-world programs with incomplete code. Our demonstration video is at: https://youtu.be/bhjJp3e7LPM.

MetropolJS: visualizing and debugging large-scale javascript program structure with treemaps

As a result of the large scale and diverse composition of modern compiled JavaScript applications, comprehending overall program structure for debugging is challenging. In this paper we present our solution: MetropolJS. By using a Treemap-based visualization it is possible to get a high level view within limited screen real estate. Previous approaches to Treemaps lacked the fine detail and interactive features to be useful as a debugging tool. This paper introduces an optimized approach for visualizing complex program structure that enables new debugging techniques where the execution of programs can be displayed in real time from a bird's-eye view. The approach facilitates highlighting and visualizing method calls and distinctive code patterns on top of code segments without a high overhead for navigation. Using this approach enables fast analysis of previously difficult-to-comprehend code bases.

The codecompass comprehension framework

CodeCompass is an open source LLVM/Clang based tool developed by Ericsson Ltd. and the Eötvös Loránd University, Budapest to help understanding large legacy software systems. Based on the LLVM/Clang compiler infrastructure, CodeCompass gives exact information on complex C/C++ language elements like overloading, inheritance, the usage of variables and types, possible uses of function pointers and the virtual functions - features that various existing tools support only partially. Steensgaard's and Andersen's pointer analysis algorithm are used to compute and visualize the use of pointers/references. The wide range of interactive visualizations extends further than the usual class and function call diagrams; architectural, component and interface diagrams are a few of the implemented graphs. To make comprehension more extensive, CodeCompass is not restricted to the source code. It also utilizes build information to explore the system architecture as well as version control information e.g. git commit history and blame view. Clang based static analysis results are also integrated to CodeCompass. Although the tool focuses mainly on C and C++, it also supports Java and Python languages.

In this demo we will simulate a typical bug-fixing work flow in a C++ system. First, we show, how to use the combined text and definition based search for a fast feature location. Here we also demonstrate our log search, which can be used to locate the code source of an emitted message. When we have an approximate location of the issue, we can start a detailed investigation understanding the class relationships, function call chains (including virtual calls, and calls via function pointers), and the read/write events on individual variables. We also visualize the pointer relationships. To make the comprehension complete, we check the version control information who committed the code, when and why.

This Tool demo submission is complementing our Industry track submission with the similar title. A live demo is also available at the homepage of the tool https://github.com/ericsson/codecompass.