PS 2021 Symposium: The Reliability Paradox: Current Issues, Partial Solutions, and Future Directions

#psynom21

Symposium

Symposium 5: The Reliability Paradox: Current Issues, Partial Solutions, and Future Directions
Saturday, November 6, 1:30 - 3:30 PM CT

Chairs: Brandon Turner (The Ohio State University) and Mark Pitt (The Ohio State University)
Recent evidence suggests that many robust group-level behavioral effects derived from attention, self-regulation/impulsivity, learning, and implicit bias paradigms are poorly suited for making reliable person-level inference (termed the “reliability paradox”). The same basic pattern is also observed in neuroimaging contexts, wherein contrasts derived from task-based functional MRI show poor test-retest reliability despite many participants showing an effect. Often attributed to the measurement properties of the task or measurement instrument, such findings have led researchers to conclude that behavioral and neural measures should not be used to study trait-like individual differences. However, (mis)measurement is a property determined by multiple factors beyond the task itself, including data pre-processing and other statistical methods used to draw inference from data. In this symposium, we will discuss issues with traditional approaches to statistical inference in the social, behavioral, and brain sciences, highlighting how generative models can mitigate such issues.

Learning from the Reliability Paradox: How Theoretically Informed Generative Models Can Improve Person-Level Inference

Presenter: Nathaniel Haines (The Ohio State University)
Theories of individual differences are foundational to social, behavioral, and brain sciences, yet they are often developed and tested using superficial summaries of data (e.g., mean response times) that are both (1) disconnected from our rich conceptual theories, and (2) contaminated with measurement error. Traditional statistical approaches lack the flexibility required to test increasingly complex theories of behavior. To resolve this theory-description gap, we present generative modeling approaches, which involve using theoretical knowledge to formally specify how behavior is generated within individuals, and in turn how generative mechanisms vary across individuals. We demonstrate the utility of generative models in the context of the “reliability paradox”, a phenomenon wherein highly replicable group effects fail to capture individual differences (e.g., low test-retest reliability). Simulations and empirical data from the Implicit Association Test, and Stroop, Flanker, Posner, and Delay Discounting tasks show that generative models yield (1) more theoretically informative parameters, and (2) higher test-retest estimates relative to traditional approaches, illustrating their potential for enhancing theory development.

Is Measuring Inhibition Doomed To Fail?

Presenter: Jeffrey Rouder (University of California, Irvine)
Establishing correlations among common inhibition tasks such as Stroop or flanker tasks has been proven quite difficult despite many attempts. It remains unknown whether this difficulty occurs because inhibition is a disparate set of phenomena or whether the analytic techniques to uncover a unified inhibition phenomenon fail in real-world contexts with small effects. We explore the field-wide inability to assess whether inhibition is unified or disparate. We do so by showing that ordinary methods of correlating performance including those with latent variable models are doomed to fail because of trial noise. We then develop hierarchical models that account for variation across trials, variation across individuals, and covariation across individuals and tasks. These hierarchical models also fail to uncover correlations in typical designs for the same reasons. While we can characterize the degree of trial noise, we cannot recover correlations in typical designs with typically small effect, even in those that enroll hundreds of people. We think there are no magical models or statistical analyses for the problem---effects in Stroop or flanker are too small for individual-difference analysis.

Using Hierarchical Generative Models To Improve Reliability in Computational Psychiatry

Presenter: Vanessa Brown (University of Pittsburgh)
Generative models of cognitive processes may enable more precise measurement of how these processes are disrupted in psychopathology. However, psychometric investigations into generative models popular in computational psychiatry, such as reinforcement learning, have been inconsistent. Using data from participants with elevated compulsivity (N=38) who completed a reinforcement learning task hypothesized to measure compulsivity-related learning disruptions (Daw two-step task) at two timepoints 1 week apart, we examined 1) whether acceptable reliability could be achieved; 2) what analysis methods led to the best reliability; and 3) whether methods that increased consistency within and across visits within-person (i.e., reliability) were also sensitive to between-person differences in performance. We found that measures from hierarchical Bayesian models, when used appropriately, led to good split-half and test-retest reliability while increasing power to detect individual differences. These methods generalized to other datasets with other reinforcement learning tasks and populations, suggesting that hierarchical generative models are a powerful tool for reliably and sensitively measuring clinically-relevant behavior.

When Is the Conventional Intraclass Correlation Not Suited for Measuring Test-Retest Reliability?

Presenter: Gang Chen (National Institutes of Health)
Test-retest reliability (TRR) measures the consistency of an effect across time, providing a critical criterion for studies of individual differences. Evidence of poor intraclass correlation (ICC) is alarmingly mounting that has recently attracted a lot of attention: ICC estimates appear unacceptably low for neuroimaging and psychometric data, casting doubt on their usability in studies of individual differences. In the current investigation, we demonstrate the limitations of the conventional ICC, and show that 1) conventional ICCs can substantially underestimate TRR;2) a single ICC value is misleading due to its failure to capture the estimation uncertainty;3) the ICC underestimation depends on the number of trials and cross-trial variability; 4) subject sample size has surprisingly little impact on ICC. In addition, we adopt a hierarchical modeling framework that a) more accurately characterizes and accounts for cross-trial variability, b) dissolves the issue of TRR underestimation by ICC, and c) illustrates the crucial role of trial sample size in improving TRR precision. We explore the puzzling observation of large cross-trial variability relative to cross-subject variability and its implications for experimental designs.

News more

11/5/2025FABBS News Highlights: November 5, 2025

10/23/2025FABBS News Highlights: October 23, 2025

	Edit This Favorite
Name:
Category:
Share:	Yes No, Keep Private

Edit This Favorite

Symposium