expect great answers

Kees van der Wagt

Kees van der Wagt
Research Director
Based in the Rotterdam office
+31 10 282 3535

MaxDiff (Best/Worst) Scaling papers

Below is an index of available MaxDiff (Best/Worst) Scaling papers. Introductory articles are listed first.

MaxDiff: Improved Measures of Importance & Preference for Segmentation (2003)

Maximum Difference (MaxDiff, or best/worst) scaling is a relatively new technique for measuring the importance or preference of multiple items. In MaxDiff tasks, respondents see sets of items (typically 4 to 6). In each set, respondents indicate which item is most important (preferred) and least important (preferred). Steve Cohen describes the methodology and presents results for a methodological study comparing MaxDiff measurement with monadic ratings and paired comparisons, and also a case study focusing on using MaxDiff for segmentation work. MaxDiff is shown to provide results that have greater between-item and between-respondent discrimination, and greater predictive accuracy than either monadic ratings or paired comparisons. Steve won the "best presentation" award with this paper at the 2003 Sawtooth Software Conference.

pdf download paper

MaxDiff/Web Technical Paper (2007)
This paper describes the technical procedures used in the MaxDiff/Web System. MaxDiff (best-worst) scaling is a trade-off method for measuring the importance or preference for multiple items, such as brands, product features, political platforms, advertising claims, etc. Any time you are considering using a rating scale, ranking scale, or constant sum scale for multiple items, you can consider using MaxDiff.

The MaxDiff methodology, originally invented by researcher and academic Jordan Louviere, has gained in popularity over the last five years. Papers on MaxDiff have won "best presentation" awards at recent ESOMAR and Sawtooth Software research conferences. It has many similarities to, but is distinctively different, from conjoint methodology and is appropriate for a wider range of research opportunities.

Sawtooth Software’s MaxDiff/Web system may be used for conducting web-based, CAPI, or paper-based MaxDiff studies. The software also supports asking the "best" half of the question only (not requiring respondents to identify the "worst" item in each set). The software may also be used for Method of Paired Comparisons research. Individual-level estimation of item scores employs Sawtooth Software’s popular hierarchical Bayes (HB) engine. Results may also be exported to Sawtooth Software’s Latent Class system for segmentation analysis.

pdf download paper

The Options Pricing Model: An Application of Best-Worst Measurement (2004)
This article offers a case study demonstrating how best/worst scaling may be used for estimating the price sensitivity of automobile buyers to different car options, such as warranty, anti-lock brakes, and keyless entry. Using best/worst scaling, the author (Keith Chrzan), shows how price sensitivity curves may be developed for each car option, presented on a common scale. Chrzan contrasts the best/worst approach with conjoint analysis, and explains the benefits of using best/worst for this particular application rather than conjoint analysis. The results validate closely to self-reported past purchase behavior for options on the most recently purchased car.

Chrzan provides some background on best/worst, and describes how Sawtooth Software's latent class and HB software may be used during analysis. This paper was voted "best presentation" at the 2004 Sawtooth Software Conference.

pdf download paper

Adaptive Maximum Difference Scaling (2006)
The author (Orme) presents results from two studies testing a new procedure called Adaptive MaxDiff Scaling. Rather than focus equal attention on estimating respondents' preferences (or importances) for best AND worst items, A-MaxDiff focuses attention on estimating best/most important items with greater precision. The interview adapts to each respondent, learning from prior responses. Items marked "worst" are discarded from further consideration. The questionnaire proceeds in stages. In the first stage, K items are shown per set. In each subsequent stage, K-1 items are shown per set, until the respondent is doing paired comparisons among the surviving (most preferred) items. Later tasks reflect increased utility balance.

The results show better hit rates for "best" items in holdouts relative to standard MaxDiff. Average population parameters are essentially identical between standard and adaptive forms of MaxDiff. Respondents take slightly less time to complete the adaptive survey, and they perceive it to be more enjoyable and less monotonous than standard MaxDiff. Orme argues that A-MaxDiff should be especially preferred when simulation methods such as TURF are used with MaxDiff data. The main drawback is decreased precision of estimates for "worst" items.

pdf download paper

Testing for the Optimal Number of Attributes in MaxDiff Questions (2006)

The authors investigate how the number of items per MaxDiff set affects dropout rates, survey length, positional bias, parameter equivalence, and predictive validity. Three commercial studies are analyzed, where the number of items per set varied from 3 items/set to 8 items/set. The number of items/set has the most influence on task length, with respondents taking significantly longer to complete 8 items/set rather than 3 items/set. Statistically significant differences among the parameters were found, but the authors note that the overall results would lead to similar managerial decisions. The predictive validity tests "hint that 3 items per question may produce slightly worse predictions than questions with more items." They conclude: "Given the slight evidence of poorer hit rates and poorer out-of-sample for 3 items per question we recommend using 4 or 5 items per question in maxdiff experiments."

pdf download paper

Anchored Scaling in MaxDiff Using Dual-Response (2009)
Traditional MaxDiff analysis leads to relative importance/preference scores. But, there is no possible way for respondents to express that (for example) all the items are important or none of the items are important. Some researchers have worried that the relative nature of the MaxDiff judgments and resulting scale means that meaningful differences between respondents or segments of respondents are lost.

This article describes a practical way, proposed by Jordan Louviere (inventor of MaxDiff), to anchor the scale for each respondent based on an important/not important threshold. A dual-response questioning device is straightforward to include in MaxDiff questionnaires, and in analysis. The author (Orme) provides empirical evidence that the dual-response approach leads to meaningful discrimination among respondents and items, beyond the information provided by the standard MaxDiff tradeoffs. The pros and cons of the approach are discussed.

pdf download paper

Anchoring MaxDiff Against a Threshold - Dual Response and Direct Binary Responses
Maximum Difference Scaling is widely used to measure the relative values of items/attributes. Despite the strengths of MaxDiff, some analysts would prefer data that represented more than just relative scores; they would prefer absolute scores scaled with respect to each respondent’s importance threshold. In this article by Kevin Lattery (Maritz Research), Kevin tested two methods for anchoring MaxDiff scores to a threshold: Dual-Response MaxDiff suggested by Louviere and a more direct method asking respondents to choose which attributes are above a threshold (using 2-point scale grid questions).

He determined that theoretically (using synthetic respondent data) the direct method would be superior, especially as the number of attributes shown in a MaxDiff task increases. With six or more attributes shown per screen the indirect dual-response method should not be used, and even five attributes per screen may not capture individual anchoring that well. In comparing the two methods with human respondents, showing only four attributes per screen, results were very similar. The rank order of utilities at the respondent level was nearly identical. However, the anchoring in the direct method was more biased by the context of the total set of attributes. So if it is important for one to have a more neutral anchor for utilities then the indirect dual-response method may be slightly better, assuming four (and certainly no more than five) attributes are shown per screen.

(Originally published in the 2010 Sawtooth Software Proceedings).

pdf download paper

MaxDiff Experiment Designer (2006)

This document describes a software package from Sawtooth Software for designing MaxDiff (best/worst) experiments. It also describes the dummy coding procedure for estimating effects from MaxDiff data, as may be applied within Sawtooth Software’s Latent Class or HB software systems.

pdf download paper

Accuracy of HB Estimation in MaxDiff Experiments (2005)
This paper communicates results of a Monte Carlo simulation study on how the precision of estimates for MaxDiff (best/worst) experiments is affected by:
  • Number of items presented per set,
  • Number of sets presented to each respondent,
  • Number of items in the overall study.

Results show that it may not be useful to ask more than about 5 items per set. The data also suggest that displaying each item 3 or more times per respondent works well for obtaining reasonably precise individual-level estimates with HB. Asking more tasks, such that the number of exposures per item is increased well beyond 3, seems to offer significant benefit, provided respondents don't become fatigued and provide data of reduced quality..

pdf download paper

MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB (2009)
This paper compares different methods of obtaining individual-level scores for MaxDiff surveys at the individual level: Simple counting, individual-level logit, and HB. Key to the success for all these methods was having enough information available for each respondent to estimate stable scores.

The author (Orme) finds that counting analysis provides reasonable population estimates of scores, but that the individual-level scores can lack precision. Precision is better under the logit model estimation methods: either individual-level logit or HB, which "borrows" information across the sample to improve the individual-level logit scores for individuals.

Despite the simplicity of the counting approach and its weaknesses, it tends to do quite well in predicting responses to holdout choices. But, across-respondent variance (heterogeneity) tends to be weaker than the other methods studied.

pdf download paper