SWAN 2018- Proceedings of the 4th ACM SIGSOFT International Workshop on Software Analytics

Full Citation in the ACM Digital Library

(No) influence of continuous integration on the commit activity in GitHub projects

A core goal of Continuous Integration (CI) is to make small incremental changes to software projects, which are integrated frequently into a mainline repository or branch. This paper presents an empirical study that investigates if developers adjust their commit activity towards the above-mentioned goal after projects start using CI. We analyzed the commit and merge activity in 93 GitHub projects that introduced the hosted CI system Travis CI, but have previously been developed for at least one year before introducing CI. In our analysis, we only found one non-negligible effect, an increased merge ratio, meaning that there were more merging commits in relation to all commits after the projects started using Travis CI. This effect has also been reported in related work. However, we observed the same effect in a random sample of 60 GitHub projects not using CI. Thus, it is unlikely that the effect is caused by the introduction of CI alone. We conclude that: (1) in our sample of projects, the introduction of CI did not lead to major changes in developers' commit activity, and (2) it is important to compare the commit activity to a baseline before attributing an effect to a treatment that may not be the cause for the observed effect.

Characterizing the influence of continuous integration: empirical results from 250+ open source and proprietary projects

Continuous integration (CI) tools integrate code changes by automatically compiling, building, and executing test cases upon submission of code changes. Use of CI tools is getting increasingly popular, yet how proprietary projects reap the benefits of CI remains unknown. To investigate the influence of CI on software development, we analyze 150 open source software (OSS) projects, and 123 proprietary projects. For OSS projects, we observe the expected benefits after CI adoption, e.g., improvements in bug and issue resolution. However, for the proprietary projects, we cannot make similar observations. Our findings indicate that only adoption of CI might not be enough to the improve software development process. CI can be effective for software development if practitioners use CI's feedback mechanism efficiently, by applying the practice of making frequent commits. For our set of proprietary projects we observe practitioners commit less frequently, and hence not use CI effectively for obtaining feedback on the submitted code changes. Based on our findings we recommend industry practitioners to adopt the best practices of CI to reap the benefits of CI tools for example, making frequent commits.

Facilitating feasibility analysis: the pilot defects prediction dataset maker

Our industrial experience in institutionalizing defect prediction models in the software industry shows that the first step is to measure prediction metrics and defects to assess the feasibility of the tool, i.e., if the accuracy of the defect prediction tool is higher than of a random predictor. However, computing prediction metrics is time consuming and error prone. Thus, the feasibility analysis has a cost which needs some initial investment by the potential clients. This initial investment acts as a barrier for convincing potential clients of the benefits of institutionalizing a software prediction model. To reduce this barrier, in this paper we present the Pilot Defects Prediction Dataset Maker (PDPDM), a desktop application for measuring metrics to use for defect prediction. PDPDM receives as input the repository’s information of a software project, and it provides as output, in an easy and replicable way, a dataset containing a set of 17 well-defined product and process metrics, that have been shown to be useful for defect prediction, such as size and smells. PDPDM avoids the use of outdated datasets and it allows researchers and practitioners to create defect datasets without the need to write any lines of code.

Is one hyperparameter optimizer enough?

Hyperparameter tuning is the black art of automatically finding a good combination of control parameters for a data miner. While widely applied in empirical Software Engineering, there has not been much discussion on which hyperparameter tuner is best for software analytics.To address this gap in the literature, this paper applied a range of hyperparameter optimizers (grid search, random search, differential evolution, and Bayesian optimization) to a defect prediction problem. Surprisingly, no hyperparameter optimizer was observed to be “best” and, for one of the two evaluation measures studied here (F-measure), hyperparameter optimization, in 50% of cases, was no better than using default configurations.

We conclude that hyperparameter optimization is more nuanced than previously believed. While such optimization can certainly lead to large improvements in the performance of classifiers used in software analytics, it remains to be seen which specific optimizers should be applied to a new dataset.

Differentially-private software analytics for mobile apps: opportunities and challenges

Software analytics libraries are widely used in mobile applications, which raises many questions about trade-offs between privacy, utility, and practicality. A promising approach to address these questions is differential privacy. This algorithmic framework has emerged in the last decade as the foundation for numerous algorithms with strong privacy guarantees, and has recently been adopted by several projects in industry and government. This paper discusses the benefits and challenges of employing differential privacy in software analytics used in mobile apps. We aim to outline an initial research agenda that serves as the starting point for further discussions in the software engineering research community.

Towards a framework for generating program dependence graphs from source code

Originally conceived for compiler optimization, the program dependence graph has become a widely used internal representation for tools in many software engineering tasks. The currently available frameworks for building program dependence graphs rely on compiled source code, which requires resolving dependencies. As a result, these frameworks cannot be applied for analyzing legacy codebases whose dependencies cannot be automatically resolved, or for large codebases in which resolving dependencies can be infeasible. In this paper, we present a framework for generating program dependence graphs from source code based on transition rules, and we describe lessons learned when implementing two different versions of the framework based on a grammar interpreter and an abstract syntax tree iterator, respectively.