Learning
Edward Vul
Human conflict and coordination relies on our ability to reason about and predict the behavior of others. We investigate how people adapt to and exploit their opponents in repeated adversarial interactions through iterated play of Rock-Paper-Scisssors (RPS).In Experiment 1, we investigate naturalistic adversarial interactions between two humans. Participants (N=116) played 300 rounds of RPS in 58 stable dyads. We find that the distribution of win count differentials differs significantly from Nash equilibrium random play (χ^2(5)=133.27, p < 0.001), suggesting that many participants are able to exploit dependencies in their opponent’s move choices. However, expected win count differentials based on observed regularities in participant behavior reveal that people fail to maximally exploit their opponents. This raises the question of what kinds of patterned behavior people are able to detect and exploit.In Experiment 2, participants (N=217) were paired against bots employing stable RPS strategies. We tested seven strategies that parametrically varied the number and source of their behavioral regularities. This allowed us to establish levels of complexity that people exploit maximally, partially, and not at all. For partially exploitable bots, participants reach close to maximal exploitation of subparts of the bot’s strategy, with chance performance otherwise, suggesting that people are selectively sensitive to particular patterns of opponent behavior.Our results show that the ability to exploit opponents in adaptive settings relies on successful detection of a limited set of patterns. A concrete understanding of the inputs people use to predict others provides insight into how people establish cooperative behavior, and why it sometimes fails.
Dr. Chris R. Sims
In recent years, computational reinforcement learning (RL) has become an influential normative framework for understanding human learning and decision-making. However, unlike the RL algorithms developed in machine learning, human learners face strict limitations in terms of information processing capacity. For example, human learning performance decreases as the number of possible states of the environment increases, even when controlling for the amount of experience with each environmental state. Collins and Frank [2012; European Journal of Neuroscience, 35(7), 1024-1035] demonstrated this experimentally in a simple instrumental learning task. Different conditions of their experiment manipulated the “set size” of visual stimuli to which subjects had to respond, and they showed that learning efficiency decreased monotonically with set size in a manner incompatible with standard RL algorithms. They interpreted the sub-optimality of human learning performance in terms of decay in human working memory. Our work proposes an alternative explanation for this phenomenon, based on the idea of bounded rationality. We propose that human learners navigate a trade-off between maximizing task performance, and minimizing the complexity of the learned action policy, where policy complexity is formalized in terms of information theory. We apply an RL model with this approach to the Collins and Frank dataset and we achieve a comparable fit to their models. The modeling result shows consistency with our hypothesis: human learners trade part of expected utility for simpler action policy due to their own information processing limitations.
Anne Collins
Most reinforcement learning (RL) experiments use familiar reinforcers, such as food or money, which are relatively objectively rewarding. However, in everyday life, teaching signals are rarely so straightforward --- often we must learn from the achievement of subgoals (e.g., high heat must be achieved before cooking), or from feedback that we have been instructed to perceive as reinforcement, yet is not intrinsically rewarding (e.g., grades). As such, investigating how similar the dynamics of learning from familiar rewards, which are well-studied, are to the dynamics of learning from subgoals and instructed rewards, which are more realistic, can help us to understand the ecological validity of laboratory reinforcement learning research.In this talk, we discuss our recent work investigating these potential similarities using computational modeling, while emphasizing individual differences. In our experiment, participants completed a probabilistic RL task, comprising multiple interleaved two-armed bandit problems, and an N-back task. Some bandits were learned using points, a familiar reward, while others were learned based on whether their selection lead to a “goal image” unique to each trial, an instructed reward. In the instructed condition, participants tended to learn more slowly, and each participant’s performance correlated with their working memory ability. Hierarchical Bayesian model comparison revealed that differences in behavior due to feedback type were best explained by a lower learning rate for instructed rewards, although this effect was reversed or absent for some participants. These strong individual differences suggest that differences in learning dynamics between familiar and instructed rewards may not be universally applicable.
Yunseo Jeong
Harhim Park
Prof. Woo-Young Ahn
The brain has multiple systems for decision-making such as instrumental and Pavlovian systems. Pre-programmed Pavlovian responses such as approach towards appetitive outcomes or freezing/withdrawal from aversive outcomes afford animals useful shortcuts for behavior but overcoming such Pavlovian bias is often necessary for achieving long-term instrumental goals. The orthogonalized Go/Nogo task (Guitart-Masip et al., 2012) is widely used to examine Pavlovian bias, but it has certain limitations. First, Pavlovian bias is observed only when learning appetitive outcomes, but not when learning aversive outcomes in the task. Second, while aversive outcomes may cause either freezing or withdrawal, the task cannot differentiate the contribution of the two response types. Third, it only yields final behavior responses but provides no information about the time-course of cooperation or competition between the two systems. To address the limitations, we developed a new version of the orthogonalized Go/Nogo task with mouse-tracking, called the orthogonalized Approach/Withdrawal task, requiring an active response on every trial. Seventy-seven healthy participants performed the task and we found they showed Pavlovian bias both for stimuli predicting appetitive and aversive outcomes. Computational modeling with hierarchical Bayesian parameter estimation also revealed their strong Pavlovian and approach biases. These results were replicated in an independent experiment where participants used keyboard buttons instead of a mouse. Mouse-tracking results suggest that Pavlovian responses to aversive cues are rather withdrawal than freezing and that response pathways are shorter for approach than withdrawal.
Dr. Leslie Blaha
Prof. Cleotilde (Coty) Gonzalez
Model variability is important: if systematic variation in model predictions does not reflect systematic variation in human behavior, the model's ability to describe, predict, and explain behavior is in question. We demonstrate a method to compare variation in model predictions to variation in human behavior in a dynamic decision making task. Dynamic decisions are a sequence of inter-dependent choices in changing environments, where human choices may systematically change over time. We can characterize these changes with a qualitative and quantitative visual analytics approach, recurrence quantification analysis (RQA). RQA visualizes (with recurrence plots) and describes (with recurrence statistics) recurring states in sequences of observations. We compared human choice sequences in a dynamic decision making task to predictions of an instance-based learning (IBL) model, a memory-based model of choice with two parameters (noise and decay). Specifically, we generated predictions using two parameterizations of the IBL model: one using default noise and decay parameters from the ACT-R cognitive architecture, another using the average of noise and decay parameters from IBL models fit to human data at the individual level. We compared the recurrence statistic distributions of the human data and both parameterizations. We find ACT-R default parameters predict more decision makers with less trial-to-trial change in choices than in human data. In contrast, the averaged parameters predict more decision makers with more trial-to-trial change in choices than in human data. RQA provides new tools for assessing model predictions, and a new source of evidence for demonstrating that models successfully characterize sequences of human choice.
Submitting author
Author