Description

Reliability has become ever important. Unfortunately, software errors continue to be frequent and account for the major causes of system failures. Further, detecting and fixing bugs is one of the most time-consuming and difficult tasks in software development. In order to facilitate the procedure, it would be highly beneficial if we can first analyze and understand the bug characteristics, and then detect the bugs automatically. The huge amount of analysis data in large software such as source code and documents, however, renders a tedious and difficult task on developers to analyze them.
This dissertation proposes a novel approach that applies data mining techniques to extract information in large software and exploit such extracted information for bug detection. Thanks to the distinguished characteristics in data mining that can efficiently handle a huge amount of data, this approach can also efficiently discover useful information from large software code and documents.
To understand the bug characteristics, this dissertation studies them from bug databases for two large and popular open source projects. The bug databases contain more than 300,000 bug reports, which is impossible to study all of them manually. In order to extract useful information from such a huge amount of data, this dissertation proposes applying text classification and information retrieval techniques to automatically classify the bugs from different dimensions, namely root causes, impacts, and software components. The study shows that this approach can help developers analyze and understand bug characteristics in large software efficiently, and facilitate testing and bug detection so as to improve reliability. Furthermore, this study has discovered several new interesting findings about bug characteristics that can provide useful guideline for related research.
One of the findings in bug characteristic study is that semantic error is the major root cause of bugs in modern software. Semantic bugs are application specific and so it requires knowledge about the application to detect them. To address this problem, this dissertation proposes using data mining technique to automatically detect software bugs. To demonstrate this approach, this dissertation presents two automatic bug detection tools, including PR-Miner that extracts programming rules and detects violations, and CP-Miner that detects copy-pasted code and related bugs.
One of bug detection tools proposed in this dissertation for large software using data mining techniques is PR-Miner. Programs usually follow many implicit programming rules, most of which are too tedious to be documented manually by programmers. When these rules are violated, bugs can be easily introduced. Therefore, it is highly desirable to automatically extract such rules and also to automatically detect violations. Previous work in this direction focuses on simple function-pair-based programming rules and additionally requires programmers to provide rule templates. PR-Miner uses frequent itemset mining to efficiently extract implicit programming rules from large software code, requiring little effort from programmers and no prior knowledge of the software. Benefiting from data mining, PR-Miner can extract programming rules in general forms that can contain multiple program elements of various types. In addition, this dissertation also proposes an efficient algorithm to automatically detect violations to the extracted programming rules, which are strong indications of bugs. The evaluation with large software code shows that PR-Miner can efficiently extract thousands of general programming rules and detect violations within minutes. Moreover, PR-Miner has detected many violations to the extracted rules, which are potential bugs.
To further demonstrate the approach, this dissertation proposes another tool to identify copy-pasted code and detect related bugs. Copy-pasted code is very common in large software because programmers prefer reusing code via copy-paste in order to reduce programming effort. However, copy-pasting is prone to introducing bugs. Unfortunately, it is challenging to efficiently identify copy-pasted code in large software. Existing copy-paste detection tools are either not scalable to large software, or cannot handle small modifications in copy-pasted code. Furthermore, few tools are available to detect copy-paste related bugs. In order to address these problems, this dissertation proposes CP-Miner that uses frequent sequence mining to efficiently identify copy-pasted code in large software, and detects copy-paste related bugs. In order to further understand copy-paste in system software, this dissertation also analyzes some interesting characteristics of copy-paste in Linux and FreeBSD.

You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).