When EZ does it: Simpler models are sometimes better than their complex cousins
Wednesday, November 9, 2016
Posted by: Stephan Lewandowsky
Stephan Lewandowsky. You are in the cognitive laboratory to participate in an experiment. A tight cluster of 300 lines at various orientations is projected onto the screen in front of you. Are they predominantly tilted to the left or to the right? The experimenter has instructed you to respond as quickly as possible by pressing one of two keys; ``z'' for ``left'' and ``/'' for ``right.'' The figure below shows a typical stimulus:
There are many such trials and in addition to being speedy, you are also asked to be as accurate as possible. Because the orientations of individual lines within each stimulus cluster are drawn from a distribution with considerable variance, the task is quite difficult. This procedure is representative of a ``choice reaction time'' task, which is one of the staple tasks of researchers in decision making.
Although the task appears simple, the data from such experiments are strikingly rich and can provide a broad window into human cognition. After all, there are not only two classes of responses (correct and incorrect), but each response class is also characterized by an entire distribution of response times across the numerous trials of each type.
A complete account of human performance in this quintessential decision making task would thus describe response accuracy and latency, and the relationship between the two, as a function of various experimental manipulations.
We now have access to a wide range of sophisticated computational models that can describe performance in such tasks. Common to nearly all of them is the assumption that when a stimulus is presented, not all information is available to the decision maker instantaneously. Instead, people gradually build up the evidence required to make a decision, which is why those models are called “sequential-sampling” models.
For illustration, let’s assume that people sample evidence in discrete time steps, where each sample represents a number that follows a specific distribution. Let’s furthermore assume that people sum this sampled evidence across time steps to reach a decision. On the assumption that evidence for one decision (e.g., lines slant left) has a positive sign whereas evidence for the other decision (lines slant right) has a negative sign, we can graphically display the decision process as a “drift” towards the top or the bottom of a two-dimensional space with time on the X axis and evidence on the Y axis. The figure below shows a few hypothetical evidence-accumulation trials from a speeded-choice experiment for a sequential-sampling model.
To complete our understanding of the model, consider the horizontal dashed lines at the top and bottom: they represent response criteria for the two responses (“left” on top and “right” on the bottom), and whenever the accumulated evidence crosses one or the other criterion, a response is made with a latency corresponding to the point along the X axis at which the boundary is crossed.
The figure above represents a situation in which the stimulus conveys information that supports a “left” decision; hence, 4 of the 5 trials converge onto the (correct) top boundary. However, the model (like people) can produce occasional errors, as shown by the purple line that crosses the bottom boundary. This can happen because each sample of evidence is inherently noisy, and sometimes the noise can get amplified across samples and an error occurs.
Sequential-sampling models have numerous attractive features. One often-cited strength is the models’ ability to handle a speed-accuracy tradeoff—that is, the robust observation that if people are instructed to respond as quickly as possible, their accuracy suffers compared to a condition in which the emphasis is on accuracy and speed does not matter. In the figure above, instructions to favor speed over accuracy would be modeled by moving the response criteria towards the origin. To illustrate, suppose people placed the criteria at +/-1: Under those circumstances one additional response (the “red” trial) would end up being in error, but responding overall would be speeded considerably (at a rough guess, by 70% or more, but you can work that out yourself from the figure).
The ability to handle tradeoffs between speed and accuracy feeds into the second strength of the model, which is the estimation of parameters that permit comparison of performance between individuals or conditions that is not affected by speed-accuracy tradeoffs. Suppose you measure performance in the above task for two people, John and Jane. John gets 80% of all trials correct and takes 800 ms on average, whereas Jane gets 90% correct but takes 900 ms on average. Who is the better performer?
This question is difficult to answer by conventional statistical means—however, it can be resolved by fitting a sequential-sampling model to the data and estimate the “drift rate” and the various boundary parameters for each subject. The drift rate describes the statistical properties of the sampling process—the greater (more positive) the drift rate, the faster responses will converge on the upper boundary—and it is considered to reflect all factors that affect the speed of evidence extraction. This includes features of the stimulus, such as the number of lines in the above stimulus that slant to the left, as well as people’s abilities to process stimuli. For example, drift rate has been found to be associated with individual-differences variables such as working memory and IQ.
So how are drift rates estimated from the data of individual subjects? This is a nuanced issue and the answer depends on the particular model that is being fit.
A recent article in the Psychonomic Bulletin & Review compared the ability of two different types of sequential-sampling models to estimate drift rates of individual participants and to compare them between conditions. Researchers Don van Ravenzwaaij, Chris Donkin, and Joachim Vandekerckhove compared a highly-complex model, known as the diffusion model, to a stylized version of the same model, known as the EZ-diffusion model. The main difference between the models was the number of parameters: the full diffusion model postulated trial-to-trial variability of three processes, including the drift rate itself. Those variability parameters were absent in the EZ version.
The reasons for the inclusion of trial-to-trial variability in the full diffusion model can be ignored for now: what is relevant is the underlying principle of how the number of parameters relates to the ability of a model to fit data. All other variables being equal, a larger number of parameters means that a model can handle a given data set better—i.e., can describe it in detail—than a competing model with fewer parameters.
But here is the problem: the more complex model may be so powerful that it also fits the noise in the data. This is illustrated in the figure below using an example from working-memory research. In both panels, the data (plotting symbols) show the proportion of correct recall as a function of “speech rate”; that is, the time it takes to pronounce the words presented for memorization. The figure shows one of the most reliable findings in working-memory research, namely that shorter words—those that can be pronounced at a faster rate—are remembered better than longer words.
What is of interest here are the solid lines that are fit to a subset of the data (the 3 dark dots). On the left, we fit a conventional linear regression with only two parameters—a slope and an intercept. On the right, we fit a fourth-order polynomial to the same 3 data points. If one only considers the fit to the 3 data points, the more complex model on the right, with its additional parameters, fits the data better than the simple linear regression. In fact, the model fits the three data points perfectly. However, despite that apparent power, the more complex model fails miserably when it comes to predicting the remaining data points (the open circles) that were not part of the sample being fitted.
Researchers van Ravenzwaaij and colleagues examined this issue using a so-called “model-recovery” procedure. The first step in this procedure is to generate synthetic experimental data from the full diffusion model. These data have the same properties as the data from a behavioral experiment, except that there is no ambiguity about the underlying “true” effect. van Ravenzwaaij and colleagues generated data for two synthetic groups of participants whose average drift rate differed between conditions—thereby simulating stimuli with different numbers of left-slanting lines.
The second step in this procedure is to fit the full diffusion model and the EZ model to those data. The third step is to do a statistical analysis on the parameter values of each of those models, and to ask which model best recovered the known statistical structure in the data. Recall that unlike in a real experiment, in this situation we know for sure how the data were generated.
At first glance, one might wonder why anyone would bother with this exercise: if the full diffusion model generated the data, surely it would tell us something about the data better than the EZ model?
In fact, van Ravenzwaaij and colleagues observed the opposite. By and large, across a wide range of simulated conditions, the simpler EZ model, which did not contain the trial-to-trial variability that is required to capture nuances in the data, did a better job in recovering the underlying group differences.
This happened for the reasons shown in the above figure with the rather contrived example involving the fourth-order polynomial: the full diffusion model fit the data well, but the estimates for the large number of parameters were not well constrained by the data. The EZ model, in contrast, could be estimated reliably because of its simpler structure, and hence it was better able to recover the true statistical nature of the data.
Does this mean a simpler model is always preferable? van Ravenzwaaij and colleagues counsel against that interpretation: because their synthetic data were generated by the full diffusion model, the simpler EZ necessarily missed out on features in the data and its parameter estimates were therefore biased.
But sometimes a little bit of bias is preferable to a lot of variance, and van Ravenzwaaij and colleagues show that there are circumstances in which a simpler model may do better than the “correct” more complex model in recovering the true nature of data.
Article focused on in this post:
van Ravenzwaaij, D. Donkin, C., & Vandekerckhove, J. (2016). The EZ diffusion model provides a powerful test of simple empirical effects. Psychonomic Bulletin & Review. DOI: 10.3758/s13423-016-1081-y.