Empirical Software Engineering
- Clone Categorization
- Clone Evaluation
- Clone Evolution
- Software Engineering for Computational Science and Engineering
- Text Retrieval based Feature Location
- Trust and Expertise in Open Source Software Development
Because 50% to 90% of developer effort during software maintenance is spent on program comprehension activities, techniques and tools that can reduce the effort spent by developers on these activities are required to reduce maintenance costs. One characteristic of a software system that can adversely affect its comprehensibility is the presence of similar or identical segments of code, or code clones. To promote developer awareness of the existence of code clones in a system, researchers recently have directed much attention to the problem of detecting these clones; these researchers have developed techniques and tools for clone detection and have discovered that significant portions of popular software systems such as the Linux kernel are cloned code. However, knowledge of the existence of clones is not sufficient to allow a developer to perform maintenance tasks correctly and completely in the presence of clones. Proper performance of these tasks requires a deep and complete understanding of the relationships among the clones in a system; thus, new techniques and tools that will assist developers in the analysis of large numbers of clones are a critical need.
The goal of this project is to develop an automated and rigorous analysis process for categorization of code clones using structural and semantic properties of the clones. Existing categorization techniques and tools, other than those developed by Tairas and Gray, consider only lexical or syntactic properties of clones. Tairas and Gray consider semantic properties of clones, but report that understanding the structural similarities and differences among clones in a software system is vital to achieving a deep and complete understanding of the relationships among those clones. The research team will develop techniques and tools that identify and codify these relationships using structural and semantic properties of clones. The expected research outcomes are: (1) A suite of metrics for measuring the congruence and complementarity of a number of graphical static program representations that capture structural properties and a process to categorize code clones based these metrics, (2) serial and integrated processes that combine structural categorization of code clones and improved semantic categorization of code clones, and (3) a domain analysis for the code clone categorization domain and empirical validation of the techniques and tools from the previous two outcomes.
Code clone analysis techniques and tools are popular topics among the software engineering research community. Many studies draw conclusions solely based on an analytical analysis. These claims focus primarily on tool performance in terms of portability, scalability, robustness, precision, and recall. However, these types of analytical studies cannot adequately evaluate the behavior of the developers while using the tools. Human-based empirical studies are complementary to studies based on analytical data because they provide direct insight into developer behavior.
To truly understand human behavior, and to validate the claims about it, researchers should add human-based empirical studies to their validation toolbox. For example, to truly understand and improve the usefulness of tool output for completing a development task, humans must be observed while using the tool. Another example in which human studies could provide insight is the question of whether clones are helpful or harmful. While early studies indicated that clones were harmful from a maintenance perspective, more recent studies have suggested that clones may actually improve productivity. An in-depth study of the actual effects of clones on developer behavior could provide important insight to this question
Human-based empirical studies focused on understanding how developers create, edit and maintain clones can open new opportunities to understand the primary needs of clone analysis and ultimately help the end user, the software developer. The goal of this project is to design and conduct human-based empirical studies to complement the wealth of analytical studies available in the literature.
Code clones are source code fragments that are similar or identical in terms of text, vocabulary, structure, or meaning. Fowler et al. classified code duplication (cloning), as a bad smell and thus as a significant indicator of poor software maintainability. However, more recent work indicates that clones are not as harmful as previously believed and actually may improve productivity. Indeed, Rahman et al. find little empirical evidence that clones negatively affect software maintainability but do find that cloned code may be less fault prone than non-cloned code. Nevertheless, there are long term risks associated with cloning, such as the potential duplication of defects and the possible loss of implicit/explicit links among code fragments that must remain consistent. Thus, detection of code clones is of concern to researchers and practitioners.
An analysis of the clone detection results for a single source code version provides a developer with information about a discrete state in the evolution of the software system. However, tracing clones across multiple source code versions permits a clone analysis to consider a temporal dimension. Such an analysis of clone evolution can be used to uncover the patterns and characteristics exhibited by clones as they evolve within a system. Developers can use the results of this analysis to understand the clones more completely, which may help them to manage the clones more effectively. Thus, studies of clone evolution serve a key role in understanding and addressing issues of cloning in software.
One focus of Dr. Kraft’s ongoing work in this area is the clone evolution patterns. Due to a lack of discernible system-independent clone evolution patterns, some researchers believe that clone evolution is a system-specific classifier. However, current clone evolution patterns are derived solely from the consistency of the changes to the clones from one source code revision to the next. Moreover, if clone evolution does follow a system specific pattern, there may be source code metrics that can be used to predict these patterns.
Software Engineering for Computational Science and Engineering
Software for computational science and engineering (CS&E) supports myriad scientific and engineering endeavors, including cancer research, analysis of large data sets (e.g., satellite data), development of new products or materials (e.g., new vehicles or HIV vaccines), and simulation of natural phenomena (e.g., particle systems or climate change). Due to society’s increasing reliance on CS&E software, understanding its development is critical. Indeed, using that understanding to discover and to support best practices for CS&E software is vital to our Nation’s future.
Dr. Carver’s previous work in this area has identified a number of important characteristics of CS&E software development that must be considered when applying SE practices to the CS&E domain. Those characteristics include: (1) because CS&E projects are often exploring new science, the requirements gathering and discovery process is difficult, (2) the main driver of the projects is the investigation of new science, not the use of appropriate SE practices, and (3) developers are wary of heavyweight processes and lean towards more agile approaches.
Text Retrieval based Feature Location
Feature location is a program comprehension activity in which a developer locates the source code entities that implement a functionality (i.e., a feature). Due to the large size of modern software systems, manual feature location is impractical. Thus, researchers have devoted much effort to developing (partially) automated feature location techniques (FLTs), many of which are based on text retrieval. Indeed, Dit et al. recently reviewed 87 articles from 26 venues and found that 27 of the 52 FLTs are based (at least in part) on text retrieval. However, these text retrieval based techniques are highly configurable. For example, when using latent semantic indexing (LSI) we must select k, the number of (reduced) dimensions, or when using latent Dirichlet allocation (LDA) we must select α, β, and K, the two smoothing hyperparameters and the number of topics, respectively.
Unfortunately, few studies of text retrieval based FLTs directly address the decisions that a practitioner or researcher must make when configuring an FLT. Indeed, the feature location literature contains no empirical evidence that supports the selection of one configuration over another. Closely related literature provides some empirical evidence, but it is mixed. For example, Marcus and Poshyvanyk report no performance change when changing configurations in their study of conceptual cohesion of classes, whereas Abadi et al. report a performance decrease when changing configurations in their study of traceability.
One focus of Dr. Kraft’s ongoing work in this area is the configuration of LDA-based feature location. LDA is parameterized by the inference algorithm used for approximation (I), α, β, K, and the similarity measure (S). During the Summer 2011 REU program at UA, students studied the effects of these parameters on the accuracy of LDA-based feature location for five medium-sized (10 KLOC to 250 KLOC) subject systems. We submitted two manuscripts (one conference paper and one journal paper) based on this work. The studies offer new insight into configuring LDA-based feature location but also leave many questions unanswered.
Trust and Expertise in Open Source Software Development
The Internet has enabled distributed teams, who have never met in person, to collaborate on software development projects using tools such as email, distributed version control, wikis, instant messengers, and chat rooms. Despite these enabling technologies, the literature suggests that distributed projects do not perform as well as collocated projects. For example, these distributed projects require more time and resources than collocated projects. The two factors of inaccurate perceptions of expertise and of low trust interact to affect the success of a distributed project. These two factors are important because when a developer’s view of a colleague’s expertise is low or incorrect, project productivity will likely decrease.
Dr. Carver’s previous work in this area has focused on understanding how developers evaluate the programming ability of their peers in a classroom setting. Based on these findings Dr. Carver has recently completed an online survey of 100+ open-source developers. The goal of this survey was to gather specific information about how open-source developers communicate with each other and how they develop opinions of the expertise of their peers.