NL4SE 2018- Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering

Full Citation in the ACM Digital Library

SESSION: Keynote

Learning from code with graphs (keynote)

Learning from large corpora of source code ("Big Code") has seen increasing interest over the past few years. A first wave of work has focused on leveraging off-the-shelf methods from other machine learning fields such as natural language processing. While these techniques have succeeded in showing the feasibility of learning from code, and led to some initial practical solutions, they forego explicit use of known program semantics. In a range of recent work, we have tried to solve this issue by integrating deep learning techniques with program analysis methods in graphs. Graphs are a convenient, general formalism to model entities and their relationships, and are seeing increasing interest from machine learning researchers as well. In this talk, I present two applications of graph-based learning to understanding and generating programs and discuss a range of future work building on the success of this work.

SESSION: Stack Overflow

LinkSO: a dataset for learning to retrieve similar question answer pairs on software development forums

We present LinkSO, a dataset for learning to rank similar questions on Stack Overflow. Stack Overflow contains a massive amount of crowd-sourced question links of high quality, which provides a great opportunity for evaluating retrieval algorithms for community-based question answer (cQA) archives and for learning to rank such archives. However, due to the existence of missing links, one question is whether question links can be readily used as the relevance judgment for evaluation. We study this question by measuring the closeness between question links and the relevance judgment, and we find their agreement rates range from 80% to 88%. We conduct an empirical study on the performance of existing work on LinkSO. While existing work focuses on non-learning approaches, our study results reveal that learning-based approaches has great potential to further improve the retrieval performance.

Two perspectives on software documentation quality in stack overflow

This paper studies the software documentation quality in Stack Overflow from two perspectives: the questioners’ who are accepting answers and the community’s who is voting for answers. We show what developers can do to increase the chance that their questions or answers get accepted by the community or by the questioners. We found different expectations of what information such as code or images should be included in a question or an answer. We evaluated six different quality indicators (such as Flesch Reading Ease or images) which a developer should consider before posting a question and an answer. In addition, we found different quality indicators for different types of questions, in particular error, discrepancy, and how-to questions. Finally we use a supervised machine-learning algorithm to predict when an answer will be accepted or voted.

SESSION: Methodology

Total recall, language processing, and software engineering

A broad class of software engineering problems can be generalized as the "total recall problem". This short paper claims that identifying and exploring the total recall problems in software engineering is an important task with wide applicability.

To make that case, we show that by applying and adapting the state of the art active learning and natural language processing algorithms for solving the total recall problem, two important software engineering tasks can also be addressed : (a) supporting large literature reviews and (b) identifying software security vulnerabilities. Furthermore, we conjecture that (c) test case prioritization and (d) static warning identification can also be generalized as and benefit from the total recall problem.

The widespread applicability of "total recall" to software engineering suggests that there exists some underlying framework that encompasses not just natural language processing, but a wide range of important software engineering tasks.

A fine-grained approach for automated conversion of JUnit assertions to English

Converting source or unit test code to English has been shown to improve the maintainability, understandability, and analysis of software and tests. Code summarizers identify 'important' statements in the source/tests and convert them to easily understood English sentences using static analysis and NLP techniques. However, current test summarization approaches handle only a subset of the variation and customization allowed in the JUnit assert API (a critical component of test cases) which may affect the accuracy of conversions. In this paper, we present our work towards improving JUnit test summarization with a detailed process for converting a total of 45 unique JUnit assertions to English, including 37 previously-unhandled variations of the assertThat method. This process has also been implemented and released as the AssertConvert tool. Initial evaluations have shown that this tool generates English conversions that accurately represent a wide variety of assertion statements which could be used for code summarization or other NLP analyses.

SESSION: Code Analysis

Towards understanding code readability and its impact on design quality

Readability of code is commonly believed to impact the overall quality of software. Poor readability not only hinders developers from understanding what the code is doing but also can cause developers to make sub-optimal changes and introduce bugs. Developers also recognize this risk and state readability among their top information needs. Researchers have modeled readability scores. However, thus far, no one has investigated how readability evolves over time and how that impacts design quality of software. We perform a large scale study of 49 open source Java projects, spanning 8296 commits and 1766 files. We find that readability is high in open source projects and does not fluctuate over project’s lifetime unlike design quality of a project. Also readability has a non-significant correlation of 0.151 (Kendall’s τ ) with code smell count (indicator of design quality). Since current readability measure is unable to capture the increased difficulty in reading code due to the degraded design quality, our results hint towards the need of a better measurement and modeling of code readability.

Mining monitoring concerns implementation in Java-based software systems

In this paper we describe a new approach for automatic identification of monitoring concerns implementation in Java-based software systems. We also present the results obtained by using our approach on 21 Java-based systems, ranging from small to very large systems.

Generating comments from source code with CCGs

Good comments help developers understand software faster and provide better maintenance. However, comments are often missing, generally inaccurate, or out of date. Many of these problems can be avoided by automatic comment generation. This paper presents a method to generate informative comments directly from the source code using general-purpose techniques from natural language processing. We generate comments using an existing natural language model that couples words with their individual logical meaning and grammar rules, allowing comment generation to proceed by search from declarative descriptions of program text. We evaluate our algorithm on several classic algorithms implemented in Python.

SESSION: Applications

TestNMT: function-to-test neural machine translation

Test generation can have a large impact on the software engineering process by decreasing the amount of time and effort required to maintain a high level of test coverage. This increases the quality of the resultant software while decreasing the associated effort. In this paper, we present TestNMT, an experimental approach to test generation using neural machine translation. TestNMT aims to learn to translate from functions to tests, allowing a developer to generate an approximate test for a given function, which can then be adapted to produce the final desired test.

We also present a preliminary quantitative and qualitative evaluation of TestNMT in both cross-project and within-project scenarios. This evaluation shows that TestNMT is potentially useful in the within-project scenario, where it achieves a maximum BLEU score of 21.2, a maximum ROUGE-L score of 38.67, and is shown to be capable of generating approximate tests that are easy to adapt to working tests.

3CAP: categorizing the cognitive capabilities of Alzheimer’s patients in a smart home environment

Alzheimer’s disease is a progressive illness that affects more than 5.5 million people in the United States with no effective cure or treatment. Symptoms of the disease include declines in memory and speech abilities and increases in aggression and insomnia. Recent research suggests that NLP techniques can detect early cognitive decline as well as monitor the rate of decline over time. The processed data can be used in a smart home environment to enhance the level of home care for Alzheimer’s patients. This paper proposes early-stage research in software engineering and natural language processing for quantifying and evaluating the patient’s cognitive state to determine the required level of support in a smart home.

Natural language processing (NLP) applied on issue trackers

In the domain of software engineering NLP techniques are needed to use and find duplicate or similar development knowledge which are stored in development documentation as development tasks. To understand duplicate and similar development documentations we will discuss different NLP techniques as descriptive statistics, topic analysis and similarity algorithms as N-grams, the Jaccard or LSI algorithm as well as machine learning algorithms as Decision trees or support vector machines (SVM). Those techniques are used to reach a better understanding of the characteristics, the lexical relations (syntactical and semantical) and the classification and prediction of duplicate development tasks. We found that duplicate tasks share conceptual information and are rather created by inexperienced developers. By tuning different features to predict development tasks with a gradient or a Fidelity loss function a system can identify a duplicate tasks with a 100% accuracy.

SESSION: 4th ACM SIGSOFT International Workshop on NLP for Software Engineering (NL4SE 2018)