Evidence accumulation or reinforcement learning? Modeling sequential decision-making in the “observe or bet” task
How do we decide whether we should explore or exploit in uncertain environments where feedback is intermittent? In this talk, we compare two approaches to computational modeling of the cognitive process underlying such decisions, using control group data from an ongoing clinical research collaboration. Participants completed multiple blocks of the “observe or bet” task, which is a dynamic sequential decision-making task. To maximize reward, participants must strike a balance between betting on (but not seeing) which event will occur, versus observing events in the sequence (and forgoing gaining or losing points). Participants efficiently alternated between observing and betting, while overall observing more at the start of a sequence, and betting more towards the end. To better understand this data, we used two classes of hierarchical Bayesian models. First, we implemented nine versions of the “heuristic model” of this task, developed by Navarro, Newell, & Schulze (2016), which posits a cross-trial evidence accumulation process. Second, we implemented eight variants of a modified reinforcement learning (RL) model, which is a novel adaptation of Q-learning. Across all models, the modified RL model with counterfactual learning and a high fixed value of observing provided the best fit to the observed data. We discuss implications for modeling of this task, and for RL modeling more generally. We emphasize how this challenges a strict conceptualization of RL, as the modified RL model’s success suggests that the same computations responsible for learning from rewards might also subserve learning from outcomes that are non-extrinsically (but potentially intrinsically) rewarding.
Keywords
Hi Beth, Very interesting talk and cool modelling. I understand your results such that the RL model not only does a better job in terms of relative model performance (WAIC), but also in terms absolute model performance (i.e., it provides the overall best fit). Is this the case across all the posterior predictive summary statistics you have looke...
Cite this as: