Abstract
Information Retrieval (IR) methods, and in particular topic models,
have recently been used to support essential software engineering (SE) tasks,
by enabling software textual retrieval and analysis.
In all these approaches, topic models have been used on software artifacts
in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters),
because the underlying assumption was that source code and natural language documents are similar.
However, applying topic models on software data using the same settings as for natural language text
did not always produce the expected results.
Recent research investigated this assumption and showed that source code is
much more repetitive and predictable as compared to the natural
language text. Our paper builds on this
new fundamental finding and proposes a novel solution to adapt, configure and effectively use
a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across
various SE tasks. Our paper introduces a novel solution called LDA-GA, which uses Genetic Algorithms
(GA) to determine a near-optimal configuration for LDA
in the context of three different SE tasks: (1) traceability link recovery,
(2) feature location, and (3) software artifact labeling.
The results of our empirical studies demonstrate that LDA-GA is able
to identify robust LDA configurations, which lead to a higher accuracy on all the datasets for these SE tasks
as compared to previously published results, heuristics, and the results
of a combinatorial search.
Object systems
- Used in the "Feature location" experiment
- Used in the "Labeling" experiment
- Used in the "Traceability recovery" experiment