Reliability of Usability Evaluations

Usability Testing and Assessment, May 2017

Introduction

When considering the reliability of usability evaluations, it is important to first consider what it means to be reliable. Reliability is the degree to which an assessment may produce consistent and stable results. However, the most useful tests must not only be reliable - they must also be valid, measuring what they are intended to do. In an ideal world, true testing reliability would mean that test results would be replicable across different organizations, users, and periods of time; yet, extensive research on testing methodologies has shown this is not the case. Over the past 25 years, the reliability of usability testing has been investigated and challenged with regard to several factors - the number and composition of users, tasks, methodology, and evaluation biases.

This review is an investigation of literature on the reliability and validity of usability evaluations. Although some reliability problems are intrinsically unavoidable, the research presented will be supplemented by clear recommendations for future practice. These recommendations are intended to resolve inconsistencies within current usability evaluation practice and provide a solid, valid foundation for developers to take action.

Users

The quality and consistency of results in a test is not only dependent on the tests themselves, but also on the users taking the tests. Since the 1990’s, a wealth of usability testing research has focused on the optimal, “magic” number of usability testers. Testing is expensive and time consuming, so in order to maximize return on investment, organizations should test the number of users that will yield a high percentage of usability problems without going so high as to result in diminishing returns. Virzi’s (1990) seminal research on the optimal number of subjects found that 5 individuals were sufficient to find approximately 80 percent of the problems. Notably, the first few users tested found the most severe problems.

However, for interfaces with a great deal of complexity, such as e-commerce websites, the extraordinary amount of content a test user may sift through in a given session means that only a fraction of the most severe usability problems may be identified (Spool & Schroeder, 2001). For such complex websites, identification of all of the most serious usability problems is difficult and thus five users may be insufficient to find the most critical usability problems. Another case in which the “5” participant size is inadequate is when lives are at stake. For medical devices, a larger group of individuals, approximately 25, is required to validate the product’s safety in line with medical regulations (Francik, 2015).

Regardless of the number of participants taking the tests, results may be meaningless if the users chosen are not representative of the target audience. Every user should be chosen with care. Product experts may be successful at identifying usability problems based on their understanding and fluency with a particular product and its functions. However, novices as well may provide fresh eyes and unbiased impressions in the testing experience. Thus, it is important to involve the development and management teams in the recruitment of users that will be of high value to the target audience. The overall goal in recruitment is to find users who can identify usability problems effectively; to achieve this goal, 5 quality users are a good start, with additional participants and resources carefully allocated to adjust for greater complexity and higher risk.

Differences Across Organizations

Research suggests that across different organizations, testing does not yield consistent results. Molich’s (2006) comparative usability evaluation (CUE-4) of four usability labs found that a minority of problems were found by all teams and that the vast majority of problems were unique, found by a single team. Further CUE studies removing possible confounding factors (limiting users to professionals, limiting the number of issues presented, and consolidating similar problems) still resulted in limited usability problem overlap across organizations (Kessler et al, 2001). These studies reveal the inadequacy of groups to consistently identify usability problems and indicate several possible factors that might lead to this variability- differences in individual test administrators, tasks, and methodology.

Confounding Factors and Biases

Moderators and Observers

Within the same organization, usability results may vary based on how individual moderators and observers act during their sessions. Triangulation of the usability experts in session and observations of convergence or divergence may shed light on the influence of individual styles or biases on the test results. Individuals with more skills, knowledge, experience levels, and social fluency may be highly effective and elicit more salient results from participants; thus a team should strive to follow the best practices of their strongest moderators and observers. Furthermore, simply practicing being an observer, moderator, and tester with a team member and running through scripts multiple times in advance of a session is recommended to improve session preparation and performance.

Task Content

To achieve consistency, tasks should be carefully constructed and evaluated for content as well as depth. Task creation requires synchrony between the usability practitioners and the management/development team. The management team is responsible for relaying specific goals and the usability practitioners are responsible for communicating those intentions with accurate content that elicits effective results. As Bloomer, Landesman & Wolfe (2007) describe, understanding business goals in particular are important to consider in tasks development. A practitioner interested in a website’s attrition rate, for example, should focus on the development of tasks which may cause frustration and hit on “pain points.” Without clear and coherent understanding communication between overhead organizations and practitioners, the reliability of the testing to achieve major goals is reduced, thus, clear channels of communication, both positive (affirming test plans) and negative (restructuring test plans) are necessary towards creating content of value.

Task Coverage

Great debate exists over the optimal breadth of usability tests. Lindgaard and Chattratichart’s analysis of the CUE-4 found significant correlations between the number of user tasks and the percentage of problems, but no significant correlations between number of users and percentage of problems found. The study concluded that emphasis should be placed on giving more tasks to a smaller group of individuals rather than giving fewer tasks to a larger size of test users. However, in a realistic setting every possible task cannot be given to a user. Usability experts must instead focus on defining the most important goals and interface features and use those in the development of deep, thoughtful tasks (Molich & Dumas, 2006). Doing so will allow deeper investigation and iteration on areas of interest, but also reduce the time spent on finding all problems, which may be minor or be less relevant to the overall goals of the usability test. This focus on productivity over quantity will result in greater return on investment of time and resources.

Methodology

Methodology may also exert strong effects on usability testing results. A limitation of Molich’s CUE-4, and perhaps a great contributor to the inconsistency in results, was that there were no constraints on the evaluation method. For greater reliability, test methods developed should not only follow universal testing best practices, but also be relatively consistent from practitioner-to-practitioner within a team, as different methodologies lead users to contribute different sets of usability problems (Molich & Dumas, 2006). Wilson (2006) suggests that embracing a greater breadth of methodologies such as surveys, interviews, questionnaires in addition to formal testing may compound the validity of the formal testing results. The triage of different research methods may also improve persuasiveness to the management team, who may be unfamiliar with the format and scales associated with formal usability testings.

Test analysis biases

The evaluator effect indicates that different individuals may vary in their analysis of a particular usability session. Jacobson, Hertzum, and John’s (1998) studies showed that when different usability experts viewed the same videotaped session, only 20% of problems were identified across all usability experts and 46% of problems were unique, only identified by a single evaluator. Poor overlap was attributed to evaluation procedure, problem criteria, and vagueness in goal analyses.

There are several possible ways to counteract the evaluator effect. One way would be for a single individual on a team to evaluate every video. Another would be for all individuals on the team to collaboratively watch the sessions together, to gain consensus on the most important issues. In the interest of time, however, the divide and conquer approach may be the most feasible, so the best approach to counteracting the evaluator effect would focus less on controlling for individuality and focus more on creating unity through problem criteria and goal analysis. To generate consistency in the reporting of usability tests and facilitate the analysis of results, teams should first decide on specific classification schemes and severity scales to use.

Secondly, teams should document goals and questions of interest task-by-task in the test plan moderator guide and share this guide with the management team well ahead of test time. Doing so will confirm the alignment of goals clearly by task and orient the usability practitioner's attention towards the classification of the problems in a prioritized manner. The step-by-step goal notes would be useful not only for recorded sessions, as in Jacobson’s studies, but also for in person, moderated sessions. Thirdly, problems should be organized and presented in a fashion that prioritizes problems of greatest severity and consolidates similar, overlapping problems.

Qualitative and Quantitative Testing

Wilson (2006) argues that a triangulation should exist to balance both qualitative and quantitative measures into usability testing. Although formative “think aloud” testing is important gage of the qualitative assessment of the user’s experience with an emphasis on diagnosis and correction of major problems, by nature, the empiricism of quantitative data is valuable, especially when it can aid in the identification and prioritization of usability problems.

Quantitative measures such as the system usability scale (SUS) post-test, also enable clear, measured evaluations of overall satisfaction that can be formed and quickly compared across different users and organizations. Additionally, quantitative measures may be more intelligible to developers and management with an analytical mindset. Overall, quantitative measures should be used in concert with qualitative “think aloud” testing, particularly to bolster confidence in the validity of usability results and persuade the developers to take remedial action.

Conclusion

In conclusion, due to the multiple confounding factors and inconsistencies that diminish the reliability of usability evaluations, less emphasis should be placed on reliability and more on the efficiency of usability evaluations- the ability to develop value with limited resources. The following takeaways are recommended to improve the reliability, but more importantly the efficiency of usability evaluations:

Practitioner's Takeaways

Number of users: start with 5, but add more users for cases of great interface complexity or high risk (for high risk consult with regulations).
Communication with Management: work closely with management and development team to develop tasks and develop a strong moderator’s guide that address goals, task by task. Address any discrepancies or questions far in advance.
Team Collaboration: work with the team to practice and prepare for evaluations with emphasis on modeling the observation and moderation style of the team’s strongest members.
Test Content: concentrate and iterate on the few, most important features of an interface and validate tasks with management.
Methodology: Triage test methodology with interviews, surveys, scales, etc when it will enhance the management and development team’s understanding.
Test Analysis: Prior to conducting session, create a moderator’s guide with detailed step-by-step notes. Prioritize problems by severity and consolidate similar problems.
Qualitative and Quantitative Measures: incorporate a mixture of qualitative and quantitative measures to boost validity and persuade developers to take action.

Works Cited

Francik, E. (2017). Five, ten, or twenty-five – How many test participants? | Newsletter from Human Factors International. HFI. Retrieved 15 March 2017, from http://www.humanfactors.com/newsletters/how_many_test_participants.asp

Jacobsen, N. E., Hertzum, M., & John, B. E. (1998, April). The evaluator effect in usability tests. In CHI 98 Cconference Summary on Human Factors in Computing Systems (pp. 255-256). ACM.

Lindgaard, G., & Chattratichart, J. (2007, April). Usability testing: what have we overlooked?. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 1415-1424). ACM.

Molich, R., & Dumas, J. S. (2008). Comparative usability evaluation (CUE-4). Behaviour & Information Technology, 27(3), 263-281.

Spool, J., & Schroeder, W. (2001, March). Testing web sites: Five users is nowhere near enough. In CHI'01 extended abstracts on Human factors in computing systems (pp. 285-286). ACM.

Wilson, C. E. (2006). Triangulation: the explicit use of multiple methods, measures, and approaches for determining core issues in product development. interactions, 13(6), 46-ff.