Abstract
The market for mobile apps is getting bigger and bigger every day, and it is expected to be worth over 100 Billion dollars in 2020. To have a chance to succeed in such a competitive environment, developers need to build and maintain high-quality apps, rapidly fixing any bug and continuously astonish their users with the coolest new features. Peculiar to the mobile apps market is the users’ reviews mechanism: Users are allowed to assign a rating to the app they download and express their feelings while using it. Such reviews are mainly designed as a feedback mechanism between users aimed at recommending each other which apps are worth downloading. However, they also contain precious information for software developers, reporting bugs (e.g., unexpected behaviours) and recommending new features to implement. To exploit such a source of information developers are supposed to manually read the user reviews, something simply not doable when hundreds of reviews per day are collected (as usual for popular apps). To help developers dealing with such a task, we developed CLAP (Crowd Listener for releAse Planning), a web application able to (i) categorize user reviews based on the information they carry out (e.g., reporting security issues), (ii) cluster together related reviews (e.g., all those reporting the same bug), and (iii) prioritize the clusters of reviews to be implemented when planning the subsequent app release. We evaluated all the steps behind CLAP, showing its high accuracy in categorizing and clustering reviews and the meaningfulness of the recommended prioritizations. Also, given the availability of CLAP as a working tool, we assessed its practical applicability in industrial environments.
Errata
There is an error in the results concerning RQ3 (prioritization accuracy). Because of a bug in the code we used to compute the results, some clusters were ignored (11/211, i.e. 5.5%). We report here the correction of Table 3 (Total number of clusters and high/low-priority clusters for each category used to answer RQ3) and Table 7 (RQ3: Prioritization accuracy). We would like to thank Mashael Etaiwi for helping us in spotting this issue.
| Table 3. Total number of clusters and high/low-priority clusters for each category used to answer RQ3. | |||
| Total | Low priority | High priority | |
| Functional bug report | 114 | 13 | 101 |
| Sugg. new feature | 78 | 11 | 67 |
| Report of security issues | 4 | 0 | 4 |
| Report of performance problems | 6 | 1 | 5 |
| Report of excessive energy consumption | 2 | 0 | 2 |
| Report for usability improvements | 7 | 2 | 5 |
| Total | 211 | 27 | 184 |
| Table 7. RQ3: Prioritization accuracy. | |||
| Accuracy | False positives | False negatives | |
| Functional bug report | 91% | 5% | 4% |
| Sugg. new feature | 81% | 10% | 9% |
| Non-functional | 79% | 10% | 10% |
Raw Data
- RQ1: the 3,000 code reviews (raw CSV file).
- RQ1: the 3,000 code reviews already preprocessed in the ARFF format, which can be used via the WEKA tool (use RandomForest to replicate the CLAP results).
- RQ1: accuracy achieved by using different machine learners (all implemented in WEKA).
- RQ2: the 160 code reviews classified as bug reporting/suggestion for new features used to evaluate the CLAP clustering feature.
- RQ2: the clustering oracles manually build by the three industrial developers (includes a README file explaining how to interpret them).
- RQ3: the CSV file containing all the reviews from all the apps considered in RQ3, with information about category, cluster and priority. (corrected version)
- RQ3: The prioritization recommended by CLAP. The arff format can be used via the WEKA tool (use RandomForest to replicate the CLAP results). (corrected version)