|
| #psynom21 |
 |











|
Symposium
Symposium 5: The Reliability Paradox: Current Issues, Partial Solutions, and Future Directions
Saturday, November 6, 1:30 - 3:30 PM CT
Chairs: Brandon Turner (The Ohio State University) and Mark Pitt (The Ohio State University) Recent evidence suggests that many robust group-level behavioral effects derived from attention,
self-regulation/impulsivity, learning, and implicit bias paradigms are poorly suited for making reliable person-level inference (termed the “reliability paradox”). The same basic pattern is also observed
in neuroimaging contexts, wherein contrasts derived from task-based functional MRI show poor test-retest reliability despite many participants showing an effect. Often attributed to the measurement properties
of the task or measurement instrument, such findings have led researchers to conclude that behavioral and neural measures should not be used to study trait-like individual differences. However, (mis)measurement
is a property determined by multiple factors beyond the task itself, including data pre-processing and other statistical methods used to draw inference from data. In this symposium, we will discuss issues
with traditional approaches to statistical inference in the social, behavioral, and brain sciences, highlighting how generative models can mitigate such issues.
Learning from the Reliability Paradox: How Theoretically Informed Generative Models Can Improve Person-Level Inference
Presenter: Nathaniel Haines (The Ohio State University) Theories of individual differences are foundational to social, behavioral, and brain sciences, yet they are often developed and tested
using superficial summaries of data (e.g., mean response times) that are both (1) disconnected from our rich conceptual theories, and (2) contaminated with measurement error. Traditional statistical
approaches lack the flexibility required to test increasingly complex theories of behavior. To resolve this theory-description gap, we present generative modeling approaches, which involve using theoretical
knowledge to formally specify how behavior is generated within individuals, and in turn how generative mechanisms vary across individuals. We demonstrate the utility of generative models in the context
of the “reliability paradox”, a phenomenon wherein highly replicable group effects fail to capture individual differences (e.g., low test-retest reliability). Simulations and empirical data from the
Implicit Association Test, and Stroop, Flanker, Posner, and Delay Discounting tasks show that generative models yield (1) more theoretically informative parameters, and (2) higher test-retest estimates
relative to traditional approaches, illustrating their potential for enhancing theory development.
Is Measuring Inhibition Doomed To Fail?
Presenter: Jeffrey Rouder (University of California, Irvine) Establishing correlations among common inhibition tasks such as Stroop or flanker tasks has been proven quite difficult despite many
attempts. It remains unknown whether this difficulty occurs because inhibition is a disparate set of phenomena or whether the analytic techniques to uncover a unified inhibition phenomenon fail in real-world
contexts with small effects. We explore the field-wide inability to assess whether inhibition is unified or disparate. We do so by showing that ordinary methods of correlating performance including those
with latent variable models are doomed to fail because of trial noise. We then develop hierarchical models that account for variation across trials, variation across individuals, and covariation across
individuals and tasks. These hierarchical models also fail to uncover correlations in typical designs for the same reasons. While we can characterize the degree of trial noise, we cannot recover correlations
in typical designs with typically small effect, even in those that enroll hundreds of people. We think there are no magical models or statistical analyses for the problem---effects in Stroop or flanker
are too small for individual-difference analysis.
Using Hierarchical Generative Models To Improve Reliability in Computational Psychiatry
Presenter: Vanessa Brown (University of Pittsburgh) Generative models of cognitive processes may enable more precise measurement of how these processes are disrupted in psychopathology. However,
psychometric investigations into generative models popular in computational psychiatry, such as reinforcement learning, have been inconsistent. Using data from participants with elevated compulsivity
(N=38) who completed a reinforcement learning task hypothesized to measure compulsivity-related learning disruptions (Daw two-step task) at two timepoints 1 week apart, we examined 1) whether acceptable
reliability could be achieved; 2) what analysis methods led to the best reliability; and 3) whether methods that increased consistency within and across visits within-person (i.e., reliability) were
also sensitive to between-person differences in performance. We found that measures from hierarchical Bayesian models, when used appropriately, led to good split-half and test-retest reliability while
increasing power to detect individual differences. These methods generalized to other datasets with other reinforcement learning tasks and populations, suggesting that hierarchical generative models
are a powerful tool for reliably and sensitively measuring clinically-relevant behavior.
When Is the Conventional Intraclass Correlation Not Suited for Measuring Test-Retest Reliability?
Presenter: Gang Chen (National Institutes of Health) Test-retest reliability (TRR) measures the consistency of an effect across time, providing a critical criterion for studies of individual
differences. Evidence of poor intraclass correlation (ICC) is alarmingly mounting that has recently attracted a lot of attention: ICC estimates appear unacceptably low for neuroimaging and psychometric
data, casting doubt on their usability in studies of individual differences. In the current investigation, we demonstrate the limitations of the conventional ICC, and show that 1) conventional ICCs can
substantially underestimate TRR;2) a single ICC value is misleading due to its failure to capture the estimation uncertainty;3) the ICC underestimation depends on the number of trials and cross-trial
variability; 4) subject sample size has surprisingly little impact on ICC. In addition, we adopt a hierarchical modeling framework that a) more accurately characterizes and accounts for cross-trial variability,
b) dissolves the issue of TRR underestimation by ICC, and c) illustrates the crucial role of trial sample size in improving TRR precision. We explore the puzzling observation of large cross-trial variability
relative to cross-subject variability and its implications for experimental designs.
|