Sensitivity to value-driven attention is predicted by how we learn from value

Reward learning is known to influence the automatic capture of attention. This study examined how the rate of learning, after high- or low-value reward outcomes, can influence future transfers into value-driven attentional capture. Participants performed an instrumental learning task that was directly followed by an attentional capture task. A hierarchical Bayesian reinforcement model was used to infer individual differences in learning from high or low reward. Results showed a strong relationship between high-reward learning rates (or the weight that is put on learning after a high reward) and the magnitude of attentional capture with high-reward colors. Individual differences in learning from high or low rewards were further related to performance differences when high- or low-value distractors were present. These findings provide novel insight into the development of value-driven attentional capture by showing how information updating after desired or undesired outcomes can influence future deployments of automatic attention.

Reward associations are learned through past experiences where an event (e.g., choosing a stimulus) is linked to a probabilistic outcome. Influential learning theories suggest that when an organism receives new information (e.g., choice outcome), current beliefs are updated in proportion to the difference between expected and actual outcomes (termed prediction error, δ). Notably, the degree by which prediction errors come to change stimulus-reward associations is determined by an additional factor termed learning rate, α (Daw, 2011;Sutton & Barto, 1998;Watkins & Dayan, 1992). Learning rates describe the rate by which new information replaces old and are fundamental to adaptive behavior. Higher learning rates result in greater trial-to-trial belief adjustments after a single instance of feedback and are linked to dopamine levels within the striatum (Frank, Moustafa, Haughey, Curran, & Hutchison, 2007) or activity changes within the anterior cingulate cortex (Behrens, Woolrich, Walton, & Rushworth, 2007); a region known to evaluate prediction errors and choice difficulty (cf. Brown & Braver, 2005;Shenhav, Straccia, Cohen, & Botvinick, 2014).
We examined whether learning rates have a direct impact on the development of value-driven attentional capture. Studies focusing on the interaction between value and capture show differential effects for high-and low-value rewards. Especially when learning is based on low value, subsequent tests assessing capture show smaller or no effects (Anderson & Yantis, 2012;Della Libera & Chelazzi, 2006). Reinforcement studies provide a possible explanation for this effect: cognitive models that are used to predict trial-to-trial learning behavior generally show higher rates for positive outcomes relative to negatives ones (Frank et al., 2007;Kahnt et al., 2009). Hence, stimulus beliefs are updated more instantly after positive outcomes and might underlie the stronger development of attentional capture for high-reward value.
We hypothesized the sensitivity of value-driven attention to be influenced by the weight that is put on learning from especially high-reward feedback. First, instrumental learning was directly followed by an attentional capture task in which participants searched for a shape singleton while a colored distractor was present on half the trials. The color of the distractors was the color most often receiving either a low or high reward in the learning task (see Fig. 1). Value-based attentional capture was expected to be strongest for colors previously associated with a high-value. Separate learning rates for high and low value were obtained by using a computational reinforcement-learning model (see Fig. 2) that reliably predicted individual trial-totrial choices (see Fig. 3). High-value learning rates (α High ) were expected to predict slowing with high-value distractors, whereas an explorative analysis focused on how learning from highor low-value outcomes relates to the differential experience of high-and low-value distractors.

Method Participants
Twenty-one participants (six males, mean age = 23 years, range 18-31 years) with normal or corrected-to-normal vision participated for a monetary compensation (M = 11.5, SD = 0.3 euros). Sample-size was based on previous studies focusing on value-driven attentional capture (range = 16-26) (Anderson et al., 2011). One participant was excluded from all analyses because of chance-level performance. Informed consent was obtained from all participants, and the local ethics committee of the VU University Amsterdam approved all procedures.

Value-based probabilistic learning
Three color pairs (AB, CD, EF) were presented in random order, and participants learned to choose one of the two color stimuli (see Fig. 1a). Colors were selected from a subset of six near-equiluminant colors (red, green, blue, yellow, purple, and turquoise), with an approximate luminance of 27.2 cd/m 2 (SD = 5.2 cd/m 2 ), and presented on a black background. For each participant, the pairs blue-yellow, red-green and purple-turquoise were randomly assigned to three categories (AB, CD, EF) and counterbalanced in mapping (e.g., blue-yellow or yellow-blue for AB). Probabilistic feedback followed each choice to indicate a high (Bcorrect^+0.10 points) or low (Bincorrect^+0.01 points) value. Choosing the high-value  Color A lead to high rewards on 80% of the trials, whereas selecting the low-value Color B lead to low rewards with 80%. Other ratios for high reward were 70:30 (CD) and 60:40 (EF). Participants were told that the total sum of points earned would be transferred into a monetary reward at the end of the experiment. Trials started with a white fixation cross  In the right plane, the learning curve for choosing A over B, or P(A|AB), is simulated for each participant with the derived parameters and evaluated against the observed data for either fits to all choice options (c), or only AB trials (d). Error bars represent SEM; β/100 for visualization followed by two colored squares (1.67°× 1.67°visual angle) left and right of the fixation cross (2.1°distance to fixation). Choices were highlighted by a white frame (3.33°× 3.33°v isual angle), and followed with feedback. Omissions or choices longer than 1,250 ms were followed with the text Btoo slow^for 300 ms. A 30-trial practice session was conducted to familiarize with the task (feedback: Bcorrect^or Bincorrect^) and followed by five blocks of 60 trials each (300 trials total; equal numbers of AB, CD and EF).

Attentional capture
Participants searched for a unique circle shape (target) among five square shapes (distractors). Responses were based on the orientation of a vertical or horizontal line contained within the circle (see Fig. 1b). On half the trials, both target and distractors were presented in white (black background). For the other half, one of the distractor squares was rendered in the highly rewarded A-color or the low rewarded B-color. The target (circle) shape was always presented in white. Trials started with a white fixation cross followed by the search display. This display showed the fixation cross, surrounded by six shapes (1.67°× 1.67°visual angle) equally spaced along an imaginary circle (5.2°radius). Feedback indicated correct or incorrect responses. Participants started with a practice block of 20 trials, followed by 120 experiment trials.

Reinforcement learning model: Q-learning
The influence of learning rates on attentional capture was investigated using the computational Q-learning algorithm (Daw, 2011;Frank et al., 2007;Watkins & Dayan, 1992). Because previous work has found stronger distractor effects for stimuli associated with high rewards, we defined separate learning rate parameters for high (α High ) and low (α Low ) value feedback (cf. Frank et al., 2007;Kahnt et al., 2009). Qlearning assumes participants will maintain reward expectation for each stimulus (A-to-F). The expected value (Q) for selecting a stimulus i (could be A-to-F) on the next trial is then updated as follows: Where 0 ≤ α High / Low ≤ 1 represent learning rates, t is trial number, and r = 1 (high) or r = 0 (low) reward. The probability of selecting one response over the other (i.e., A over B) is computed as: With 0 ≤ β ≤ 100 being known as the inverse temperature.

Bayesian hierarchical estimation procedure
The Q-learning algorithm was fit using a Bayesian hierarchical estimation method where parameters for individual subjects are drawn from a group-level distribution. This hierarchical structure is preferred for parameter estimation because it allows for the simultaneous estimation of both group-level parameters and individual parameters (Lee, 2011;Steingroever, Wetzels, & Wagenmakers, 2013;Wetzels, Vandekerckhove, Tuerlinckx, & Wagenmakers, 2010). Figure 2 shows a graphical representation of the model. The quantities r i, t− 1 (reward participant i on trial t -1) and ch i , t (choice participant i on trial t) can be obtained directly from the data. The quantities α Hi (α High participant i), α Li (α Low participant i) and β i are deterministic because we model their respective probit transformations z ′ i (α′ Hi, α′ Li , β′ i ). The probit transform is the inverse cumulative distribution function of the normal distribution. The parameters z ′ i lie on the probit scale covering the entire real line. Parameters z ′ i were drawn from group-level normal distributions with mean μ z ′ and standard deviation δ z ′ . A normal prior was assigned to group-level means μ z 0 ∼N 0; 1 ð Þ, and a uniform prior to the group-level standard deviations δ z 0 ∼U 1; 1:5 ð Þ (Steingroever et al., 2013;Wetzels et al., 2010).
Two parallel versions of the Q-learning model were implemented to optimize fits to all trials (i.e., A-to-F), or only AB trials (used in the attention task). Both models were implement ed in Stan ( Hom an & Gelm an, 2014; Stan Development Team, 2014). Multiple chains were generated to ensure convergence, which was evaluated with the Rhat statistics (Gelman & Rubin, 1992). Evaluations ensured convergence for both fit procedures (i.e., all Rhats were close to 1). Figure 3 shows group-level posteriors (a, b), and data recovery evaluations (c, d).
The definition of two learning parameters was justified with the evaluation of a hierarchical Q-model with only one learning parameter, which was updated after each trial. Model selection was based on individual and group-level Bayesian Information Criterion (BIC), using a random-effects model on the log likelihoods (Jahfari, Waldorp, Ridderinkhof, & Scholte, 2015), and supported the use of two learning rates with lower BIC values (BIC_group: Q_2alpha = 2964, Q_1alpha = 3155; BIC_individual mean: Q_2alpha = 154, Q_1alpha = 162).

Analysis
The choice for three probability pairs during training allowed us to compute and differentiate both specific (only considering the reliable 80-20 feedback) and general (across all contingencies) learning rates for high and low rewards. The choice for AB colors in the capture task was both pragmatic (considering the total number of distractor and nondistractor trials) and based on the literature, as value-based attention is commonly studied with a high reward contingency of 80%. Consistently, the attentionalcapture task only used the most distinct A and B colors from the learning task (contingency 80-20). The relationship between value-based capture and learning parameters (α High and α Low ) was evaluated with model fits to (1) only AB trials or (2) all Ato-F trials. Because both learning parameters were restricted between 0 and 1, Spearman's rank correlation (rho) or partial correlation (rho pcor ) was used to evaluate capture-learning relationships.

Value-based learning and attentional capture
In the learning task, subjects reliably learned to choose the most rewarded option from all three pairs. For each pair the probability of choosing the better option was above chance (ps < .001), and the effect of learning decreased from AB (M = 0.81, SD = 0.12) to CD (M = 0.73, SD = 0.17) to EF (M = 0.71, SD = 0.15), F(1, 19) = 5.12, p = .036, η 2par = 0.21.

Value-based attentional capture and learning rates
Next, we examined whether individual differences in the rate of information updating after desired (α High ), or undesired (α Low ) outcomes was predictive for the magnitude of automatic capture. Evaluations of the Q-learning model showed both fit-procedures to reliably predict individual trial-to-trial choices during the learning task (Fig. 3c,d). Learning parameters derived from the models where then used to examine the relationship with attentional capture (Table 1).
When the model was optimized to predict AB choices, results showed a strong relationship between α High and highvalue slowing (rho pcor = 0.69, p = .00007; Fig. 4b), while controlling for the nonsignificant relationship between α Low and high-value distractors (rho pcor = 0.09, p = 0.70). No significant relationship was found between α High and capture with lowvalue colors (p = .35). Hence, participants who updated their beliefs robustly after high rewards (higher α High ) experienced more slowing when the distractor had the high-reward A color. This relationship was very specific to high-value learning rates and not predicted by the sampling/selection frequency of the A color (% correct AB pairs) during learning (rho = 0.24, p = .31), or the estimated belief (Q value A color) at the end of learning (rho = 0.24, p = .30). No relationship was found between learning rates and slowing when the model was optimized to predict all learning-task choices (A-to-F), with reward probabilities 80:20, 70:30, and 60:40 (all ps > .05).
Most nonreward studies find a significant slowing effect for colored singletons. However, attentional capture for low-reward colored singletons is not always found. We explored whether capture differences in RT between high-and low-reward distractors relate to learning differences in relation to high-or low-reward outcomes. Results showed larger differences between α High and α Low (α H-L = α High -α Low ) to predict larger RT differences between high-and low-value distractors for AB-model fits (rho = 0.47, p = .04; Fig. 4c) and all trial model fits (rho = 0.59, p = .008). This relationship remained reliable after the removal of the lowest point for fits-to-all trials (rho = 0.52, p = .02), but was only marginal for fits-to-AB trials (rho = 0.40, p = .09).

Discussion
This study relates the underlying mechanisms of reward learning to the development of value-based attentional capture. We showed how learning from high-or low-value outcomes develops into value-driven attentional biases. This finding sheds light on a surge of recent results focusing on the consequences of reward on attention. For example, value-driven capture is generally stronger when learning is based on high values (Anderson & Yantis, 2012;Chelazzi et al., 2013;Della Libera & Chelazzi, 2006). This has been attributed to the implicit assumption that high-value distractors capture attention more robustly than low-value distractors . We refine this assumption by demonstrating how individual differences in learning relate to the magnitude of value-driven attentional capture.
Our results show how value learning in a task that is completely unrelated to visual search may develop in robust value-driven capture. Such attentional biases were shown after classic conditioning, and instrumental tasks with a direct resemblance to the capture task (Anderson et al., 2011;Della Libera & Chelazzi, 2009;Hickey, Chelazzi, & Theeuwes, 2010), or a focus on next-trial decision modulations with previously rewarded distractors (Itthipuripat, Cha, Rangsipat, & Serences, 2015). We extend current beliefs by showing how instrumental learning can transfer into the automatic capture of attention for a single feature, irrespective of context (see Anderson, 2014, for differences with classic conditioning).
Attentional selection plays an important role during learning, and is especially useful if some information is more relevant (e.g., Dayan, Kakade, & Montague, 2000). Here, high-and low-value colors were always presented simultaneously during learning. Importantly, the subsequent capture task only showed reliable slowing effects for high-value colors. Neurophysiological work has suggested selective attention to suppress processing of undesired stimuli, which in effect may imply that only the high-reward stimulus is processed (Moran & Desimone, 1985). Optimal responses during learning could involve attentional priority toward the desired high-value color (leading to value-based capture), and suppression of the undesired low-value color (reduced distraction in future tasks).
Higher learning rates represent stronger trial-to-trial belief updates about the chosen stimuli and could motivate the advanced prioritization of the desired stimulus. This predicts participants with a steep learning rate (for high-value outcomes) to prioritize earlier and longer, and so experience more capture in future tasks (Kahnt, Park, Haynes, & Tobler, 2014;Störmer, Eppinger, & Li, 2014). Compatibly, we found belief updates after high-value outcomes to predict the degree of capture with high-value distractors. A final explorative analysis indicated how learning rate differences from high-and low-value outcomes relate to capture differences, given a low-or highvalue distractor. Participants who learned faster from positive outcomes, experienced more capture from high-than from lowvalue distractors. These findings indicate learning rates to modulate selective attention during learning, and by doing so, shape the experience of capture in future contexts.
Notably, the transfer of value into capture was sensitive to both high value and feedback consistency. However, the differential capture of attention with high-or low-reward distractors was more sensitive to how we learn differentially from reward magnitude in general. These probability specific (i.e., transfer) and general (i.e., differential experience) relationships are novel and should be studied further to understand the significance of either magnitude, or consistency, in the development of automatic attention. For example, future designs could use only high-value distractors, while feedback consistency is varied during learning (O'Doherty, 2014).
This study provides novel prospects to incorporate both computational and neuroscience theories in our understanding of value-driven capture. For example, the magnitude of learning from positive feedback is attributed to striatal dopamine levels, whereas trial-to-trial adjustments after a single instance of negative feedback relate to elevated dopamine within the prefrontal cortex (PFC; Frank et al., 2007). PFC learning effects are part of a controlled learning system with a strong dependence on working memory capacity (Collins & Frank, 2012), and increased dopamine levels within PFC are reported to overstabilize working memory representations such that they persist over time (Durstewitz, Seamans, & Sejnowski, 2000). The effects found with high-value rewards could rely on higher dopamine levels within the striatum, a region not restricted by memory decay or capacity and central to the formation of Bhabit memory ( Knowlton, Mangels, & Squire, 1996;Pasupathy & Miller, 2005). Consistently, elevated levels of dopamine in PFC could selectively modulate the less intuitive learning rates after lowvalue outcomes through the stabilization of working memory representations, and so influence their transfer into future capture. High-value capture is recently linked to working memory performances (Anderson et al., 2011), prediction (Sali, Anderson, & Yantis, 2014), and dopamine (Anderson et al., 2016;Hickey & Peelen, 2015). Future couplings between Relationship between α High and the magnitude of slowing caused by the high-value distractor. (c) Individual differences in learning from high-or low-value outcomes (α High-Low = α High -α Low ) predicted RT differences between high-and low-value distractors