Evidence that the ventral stream codes the errors used in hierarchical inference and learning

Hierarchical feedforward processing makes object identity explicit at the highest stages of the ventral visual stream. We leveraged this computational goal to study the fine-scale temporal dynamics of neural populations in posterior and anterior inferior temporal cortex (pIT, aIT) during face detection. As expected, we found that a neural spiking preference for natural over distorted face images was rapidly produced, first in pIT and then in aIT. Strikingly, in the next 30 milliseconds of processing, this pattern of selectivity in pIT completely reversed, while selectivity in aIT remained unchanged. Although these dynamics were difficult to explain from a pure feedforward perspective, a model class computing errors through feedback closely matched the observed neural dynamics and parsimoniously explained a range of seemingly disparate IT neural response phenomena. This new perspective augments the standard model of online vision by suggesting that neural signals of states (e.g. likelihood of a face being present) are intermixed with the error signals found in deep hierarchical networks.


32
The primate ventral visual stream is a hierarchically organized set of cortical areas 33 beginning with the primary visual cortex (V1) and culminating with explicit (i.e. linearly 34 decodable) representations of objects in inferior temporal cortex (IT) (1) that quantitatively 35 account for invariant object discrimination behavior (2). Consistent with a feedforward flow of 36 processing from V1 to V2 to V4 to IT, neurons at higher cortical stages are more selective for 37 object shape and identity while being more tolerant to changes in object size and position (3)(4).

38
Formalizing object recognition as the result of a series of feedforward computations yields 39 models that achieve impressive performance on basic object categorization tasks (5)(6) similar 40 to the level of performance achieved by IT neural populations (7) (8). Importantly, these models  Supplementary Fig. 1b). These 126 images create an "aperture problem" because they are difficult to distinguish based on local 127 information alone and must be disambiguated based on the surrounding context (27). This 128 image set thus poses a more stringent challenge of face detection ability than standard screen 129 sets which vary along many stimulus dimensions (i.e. faces vs bodies and non-face objects; see the 'cyclops' which contain information that is globally inconsistent with a face (28), we identified 132 13 atypical face part configurations that drove neurons to produce an early response that was 133 >90% of their response to a correctly configured whole face (Fig. 2a). Because these images 134 drove a high feedforward response, we view them as being relatively well matched in their low-  parts) rather than a reversal of preference (Fig. 4a, right). As a result, the majority of anterior 177 sites preferred images with typical arrangement of the face parts in the late phase of the 178 response (prefer typical: 60-90 ms = 78% of sites vs. 100-130 ms = 78% of sites; p > 0.05, n = 179 40 sites) despite only a minority of upstream sites in pIT preferring these images in their late 180 response (Fig. 4b, black bars). This suggests that spiking responses of individual aIT sites 181 resolve images as expected from a computational system whose purpose is to detect faces, as 182 previously suggested (29). Finally, in cIT whose anatomical location is intermediate to PIT and 183 AIT, we observed a selectivity profile over time that was intermediate to that of pIT and aIT 184 consistent with its position in the ventral visual hierarchy (Fig. 4a,b,

213
We also tested the hypothesis that absolute initial firing rates, which were not perfectly 214 matched, were somehow responsible for producing the pIT image preference reversal. We 215 found no support for this hypothesis --the observed change in firing rate over time (Δ pIT =r pIT late -216 r pIT early ) was weakly correlated with the strength of the initial response (ρ pIT early, Δ pIT = -0.24 + 217 0.15, p = 0.044, n = 20 images; for these firing rate controls, the original whole face image drove 218 a much higher response than the synthetic images we created, and being a firing rate outlier, 219 we excluded this image). Instead, firing rate changes over time were strongly correlated with the 220 class (typical versus atypical) of the image (ρ class, Δ pIT = -0.77 + 0.04, p < 0.01, n = 20 images). In 221 other words, responses to images with normally arranged face parts were specifically weaker by 222 18% on average in the next phase of the response (Δrate (60-90 vs 100-130 ms) = -18% + 4%, 223 p < 0.01; n = 7 images), but responses to images with unnatural arrangements of face parts, 224 which also drove high initial responses, did not experience any firing rate reduction in the next 225 phase of the response (Δrate (60-90 vs 100-130 ms) = 2 + 1%, p > 0.05; n = 13 images). This 226 dependence on the image class and not on initial response strength argues against 227 explanations such as rate-driven adaptation that solely depend on a unit's activity to explain 228 decreasing neural responses over time. Indeed, we found that the late phase firing rates in PIT 229 could not be predicted from early phase pIT firing rates (ρ pIT early, pIT late = 0.07 + 0.17, p > 0.05; n 230 = 20 images). In contrast, we found that pIT late phase firing rates were better predicted by 231 early phase firing rates in downstream regions cIT and aIT (ρ cIT early, pIT late = -0.52 + 0.11, p < 232 0.01; ρ aIT early, pIT late = -0.36 + 0.14, p = 0.012; n pIT =115, n cIT =70, n aIT =40 sites). That is, for images 233 that produced high early phase responses in cIT and aIT, the following later phase responses of 234 units in the lower level area (pIT) tended to be low, consistent with the hypothesis that feedback 235 from those areas is producing the pIT selectivity reversals. Finally, the relative speed of 236 selectivity reversals in pIT (~30 ms peak-to-peak) makes explanations based on fixational eye 237 movements or shifts in attention (e.g. from behavioral surprise to unnatural arrangements of 238 face parts) unlikely as saccades and attention shifts occur on slower timescales (hundreds of 239 milliseconds) (30).

241
Computational models of neural dynamics in IT

242
Given the above observations of a non-trivial, dynamic selectivity reversal during face detection, 243 we next proceeded to build formal models of gradually increasing complexity to determine the minimal set of assumptions that could capture our empirical findings. We used a linear 245 dynamical systems modeling framework to evaluate dynamics in different hierarchical 246 architectures ( Supplementary Fig. 1 a,b and see SI Methods). A core principle of feedforward 247 ventral stream models is that object selectivity is built by feature integration from one cortical 248 area to the next cortical area in the hierarchy leading to low dimensional representations at the 249 top of the hierarchy. Here, we take the simplest feature integration architecture where a unit in a 250 downstream area linearly sums the input from units in an upstream area to produce greater 251 downstream selectivity than any upstream input alone. This generic encoding model 252 conceptualizes the idea that different types of evidence, local (i.e. parts) and global (i.e. 253 arrangement of parts), have to converge and be integrated to separate face from non-face 254 images in our image set. Dimensionality reduction as performed in this network is a key 255 computation specified by most network architectures whether unsupervised (i.e. autoencoder) 256 or supervised (i.e. backprop) allowing these networks to learn an abstracted, low dimensional 257 representation from the high-dimensional input layer. Here, we performed dimensionality 258 reduction in linear networks as monotonic nonlinearities can be readily accommodated in our 259 framework (14) (21). First, we focused on two stage models to use the simplest configuration 260 possible and gain intuition since stacks of two processing stages can be used to generate a 261 hierarchical system of any depth. In this framing, the activity of the unit in the output stage 262 corresponds to aIT which integrates activity from units in the deepest hidden stage measured 263 corresponding to pIT (Figure 4, top row, first five models, and Supplementary Fig. 1 a,b) 264 bearing in mind that aIT is actually a hidden stage of processing with respect to the next 265 processing stage in the larger cortical stack. When an external step input is applied to such a 266 system, it will of course produce a (lagging) step response in each of the two stages. We here 267 sought to determine how adding recurrent connectivity to this basic feedforward architecture 268 could generate internal dynamics beyond those simple dynamics, and to compare those 269 dynamics with the observed IT neural dynamics.

271
Based on previous ideas in the literature, we considered lateral inhibition within a stage, 272 normalization within a stage (23), and cortico-cortical feedback (14). Adding recurrent lateral 273 inhibitory connections leads to competition within a stage which can limit responses over time to

299
When we fit each of the models to our neural data, they were all able to produce an 300 increase in selectivity from the first stage of the network to the second stage of the network.

301
This increase is not surprising because all models had converging feedforward connections 302 from the first to second stages (Figure 5a, first five columns, compare green and black curves).

303
However, we found that neither the lateral inhibition model, nor the normalization model, could 304 capture the observed selectivity reversal phenomenon in pIT. Instead, the selectivity of these 305 models simply increased to a saturation level set by the leak term (shunting inhibition) in the 306 system (Figure 5a, first five columns). Similar behavior was present when we tried a nonlinear 307 implementation of the normalization model that more powerfully modulated shunting inhibition 308 (23). That the normalization models performed poorly can be explained by the fact that 309 responses to a strong stimulus even when normalized can meet but not fall below those to a

314
In contrast to the above models, we found that the feedback model capable of computing

330
Finally, we asked whether our results generalized to larger networks of increasing depth.

331
We found similar results for three layer versions of the models described above. Specifically, the 332 dynamics of error signals in a three-layer model produced a good match to our data collected 333 from three successive cortical areas (Fig. 5a, seventh column), while state signals in three layer 334 model networks did not produce the observed IT face selectivity dynamics (Fig. 5b,      properties (color, spatial frequency, contrast). To test whether our network displayed these 384 different dynamical behaviors, we simulated familiar inputs as those that match the learned 385 weight pattern of a high-level detector and novel inputs as those with the same overall input 386 level but with weak correlation to the learned network weights (here, we have extended the 387 network to include two units in the output stage corresponding to storage of the two familiarized 388 input patterns to be alternated; conceptually, we consider these familiar pattern detectors as 389 existing downstream of IT in a region such as perirhinal cortex which has been shown to code 390 familiarized image statistics and memory-based object signals (39)). We repeatedly alternated two familiar inputs or two novel inputs and found that model responses in the hidden processing 392 stage were temporally sharper for familiar inputs that matched the learned weight patterns 393 compared to novel, unlearned patterns of input, consistent with the previously observed 394 phenomenon ( Fig. 6b; data reproduced with permission from Meyer et al., 2014 (13)). Model 395 responses reproduced additional details of the neural dynamics including a large initial peak 396 followed by smaller peaks for responses to novel inputs and a phase delay in the oscillations of 397 responses to novel inputs compared to familiar inputs. Intuitively, these dynamics are composed 398 of two phases. After the initial response transient, familiar patterns lead to lower errors and 399 hence lower neural responses than random patterns (see Fig. 6b, red curve drops below the 400 blue curve after the onset response), similar to the observed weaker response to more familiar 401 face-like images present in our data (Fig. 2d). When the familiar pattern A is switched to 402 another familiar pattern B, this induces a short-term error in adjusting to the new pattern ( Fig.   403 6b, red curve briefly goes above the blue curve during pattern switch and then decreases).

404
Because unfamiliar patterns are closer together in the high-level encoding space than two 405 learned patterns (Fig. 6b,   evolve to not prefer typical face part arrangements. This behavior was inconsistent with a pure 447 feedforward model, even when we included strong nonlinearities in these models, such as 448 normalization. However, we showed that augmenting the feedforward model so that it 449 represents the errors generated during hierarchical processing produced the observed neural 450 dynamics (Fig. 5). This view argues that a fraction of cortical neurons codes error signals. Using 451 this new modeling perspective, we went on to generate a series of predictions consistent with 452 observed IT neural phenomena (Fig. 6). Importantly, this perspective provides an alternative

461
The precise fractional contribution of errors to neural activity is difficult to estimate from 462 our data. Under the primary image condition tested, not all sites significantly decreased their 463 selectivity (~60%). We currently interpret these sites as coding state (feature) estimates ( Fig.   464 2c, gray and black dots), and we did observe evidence of emergence of state-like signals in our 465 superficial neural recordings (Fig. 6c). Alternatively, at least some of the non-reversing sites 466 might be found to code errors under other image conditions than the one that we tested.

467
Furthermore, while in our primary image condition selectivity reversals only accounted for 20% 468 of the overall spiking modulation (Fig. 2d), we found larger modulations in late phase neural 469 firing (50-100%) under other image conditions tested (Fig. 6 a,b). At a computational level, the 470 absolute contribution of error signals to spiking may not be the critical factor as even a small 471 relative contribution may have important consequences in the network.

473
Error signals generated across different hierarchical inference and learning models

474
The notion of error is inherent to many existing models in the literature that go beyond the basic

480
Finally, recent models incorporate aspects of both inference and learning (42)(43) (Fig. 7; 481 bottom two rows). A key, unifying feature across inference and learning models is the need to 482 compute an error signal between processing stages. This error signal can be in the form of a 483 generative, reconstruction cost (stage n predicting stage n-1) or a discriminative, construction 484 cost (stage n-1 predicting stage n). Regardless, this across-stage "performance" error term is 485 used in all models, is typically the only term combining signals from different model layers, and 486 is distinct from within-stage "regularization" terms (i.e. sparseness or weight decay) in driving 487 network behavior. The present study provides evidence that such errors are not only computed, 488 but that they are explicitly encoded in spiking rates. To test the robustness of this claim across different model implementations, we tested models with different performance errors 490 (reconstruction, nonlinear reconstruction, and discriminative) and found similar population level 491 error signals across these networks (Supplementary Fig. 2). Thus, errors as broadly construed

505
(~200 ms). By looking more closely at the fine time scale dynamics of the IT response, we 506 suggest that this same "extreme coding" phenomenon can instead be interpreted as a natural 507 consequence of networks that have an actual tuning preference for typical faces (as evidenced 508 by an initial response preference for typical faces in pIT, cIT, and aIT; Fig. 4b) but that also 509 compute error signals with respect to that preference. The hierarchical error coding framework 510 proposed here provides a single, unifying account of many other reliable but previously 511 unexplained phenomena in IT: sublinear integration of multiple inputs (35)(28) (Fig. 6a) (Fig. 6b, response to first presentation is larger for the novel inputs), and 515 rapid response dynamics for familiar over novel images (13) (Fig. 6b, (Fig. 3c), and all images were presented for 100 ms duration

628
Neural data analysis. The face patches were physiologically defined in the same manner as in 629 our previous study 28 . Briefly, we fit a graded 3D sphere model (linear profile of selectivity that 630 rises from a baseline value toward the maximum at the center of the sphere) to the spatial 631 profile of face versus nonface object selectivity across our sites. We tested spherical regions 632 with radii from 1.5 to 10 mm and center positions within a 5 mm radius of the fMRI-based 633 centers of the face patches. The resulting physiologically defined regions were 1.5 to 3 mm in 634 diameter. Sites which passed a visual response screen (mean response in a 60-160 ms window 635 >2*SEM above baseline for at least one of the four categories in the screen set) were included in further analysis. All firing rates were baseline subtracted using the activity in a 25-50 ms 637 window following image onset averaged across all repetitions of an image. Finally, given that 638 the visual response latencies in monkey 2 were on average 13 ms slower than those in monkey 639 1, we applied a single latency correction (13 ms shift to align monkey 1 and monkey 2's data) 640 prior to averaging across monkeys. This was done to so as not to wash out any fine timescale 641 dynamics by averaging though similar results were obtained without using this latency 642 correction, and this single absolute adjustment was more straightforward than the site-by-site 643 adjustment used in our previous work (similar results were obtained using this alternative 644 latency correction) 28 . The observed selectivity dynamics (Fig. 2) were found in each monkey 645 analyzed separately (Fig. 3a). Images that produced an average population response > 0.9 of 646 the initial response (60-100 ms) to a face-like image were analyzed further (Figs. 2-4). In follow-   Fig. 1). Here, we provide a basic 670 description for each model tested. All models utilize a 2x2 feedforward identity matrix A that 671 simply transfers inputs u (2x1) to hidden layer units x (2x1) and a 1x2 feedforward matrix B that 672 integrates hidden layer activations x into a single output unit y.

676
To generate dynamics in the simple networks below, we assumed that neurons act as leaky

694
Normalization. An inhibitory term that scales with the summed activity of units within a stage is 696 697 (4)

704
Since the normalization term in equation (5) is not continuously differentiable, we used the 705 fourth-order Taylor approximation around zero in the simulations of equation (5).

707
Feedback (linear reconstruction). The feedback-based model is derived using a normative 708 framework that performs optimal inference in the linear case 14  712 713 (6)

715
Differentiating this coding cost with respect to the encoding variables in each layer x, y yields: 716 717 (7) 718 719 The cost function C can be minimized by descending these gradients over time to optimize the 739 740 (9)

742
Error signals computed in the feedback model. In equation (9) (10)) and gradients for offline learning (dynamics in weight space).

760
In order for the reconstruction errors at each layer to be scaled appropriately in the 761 feedback model, we invoke an additional downstream variable z to predict activity at the top 762 layer such that, instead of e 2 =y which scales as a state variable, we have e 2 =y-C T z 763 (Supplementary Fig. 1a). This overall model reflects a state and error coding model as

770
Feedback (three-stage). For the simulations in Figs. 5,6, a three-stage version of the above models was used. These deeper network were also wider such that they began with four input 772 units (u) instead of only two inputs in the two-stage models. These inputs converged through 773 successive processing stages (w,x,y) to one unit at the top node (z) (Supplementary Fig. 1b).

775
Feedback (nonlinear reconstruction). We tested versions of feedback-based models that 776 optimized different cost functions other than a linear reconstruction cost (Supplementary Fig.   777 2). In nonlinear hierarchical inference, reconstruction is performed using a monotonic 778 nonlinearity with a threshold (th) and bias (bi): 782 783 784 (12)

815
Model parameter fits to neural data. In fitting the models to the observed neural dynamics, we 816 mapped the summed activity in the hidden stage (x) to population averaged activity in pIT, and 817 we mapped the summed activity in the output stage (y) to population averaged signals 818 measured in aIT. To simulate error coding, we mapped the reconstruction errors e 1 =x-B T y and 819 e 2 =y-C T z to activity in pIT and aIT, respectively. We applied a squaring nonlinearity to the model 820 outputs as an approximation to rectification since recorded extracellular firing rates are non-

830
Parameter values were fit in a two step procedure. In the first step, we fit only the 831 difference in response between image classes (differential mode which is the selectivity profile

854
For Fig. 6b, we approximated novel versus familiar images as random patterns versus 855 structured input patterns that matched the learned weights of the network. Here, we used a 856 version of the model with two independent outputs reflecting detectors for two familiarized input 857 patterns (output 1 tuned to pattern A: u 1 , u 2 , u 3 , u 4 active and output 2 tuned to pattern 2: u 5 , u 6 , 858 u 7 , u 8 active) (Fig. 6b). Alternating between these two input patterns simulates alternation of two McGovern Institute for Brain Research. 891 Figure 1 Neurophysiological recordings of face-selective subregions in the ventral visual stream. The ventral visual stream is a series of hierarchically connected areas (diagram in top row) and includes at least three IT processing stages (blue box). Neurons in these three stages were recorded along the lateral convexity of the inferior temporal lobe spanning the posterior to anterior extent of IT (+0 to +20 mm AP, Horsely-Clarke coordinates) in two monkeys (data from monkey 1 are shown). "Face neuron" sites (red) were operationally defined as those with a response preference for images of faces versus images of nonface objects (see SI Methods). While these were found throughout IT, they tended to be found in clusters that mapped to previously identified subdivisions of IT (posterior, central, and anterior IT) and also corresponded to faceselective areas identified under fMRI in the same subject (28)(48) (STS=superior temporal sulcus, IOS=inferior occipital sulcus, OTS=occipitotemporal sulcus).    Supplementary Fig. 1  a,b) were constructed to model neural signals measured in pIT and aIT corresponding to the first (green) and second (black) model processing stages. All models received two inputs (gray) into two hidden stage units (green) which sent feedforward projections that converged onto a single unit in the output stage (black). Besides this feedforward architecture, additional excitatory and inhibitory connections between units were used to implement recurrent dynamics (self connections reflecting leak currents are not shown here for clarity; see Supplementary Fig. 1a for detailed diagrams). In the five models on the left, the responses of the simulated neurons are assumed to code the current estimates of some set of features in the world (a.k.a states), as is standard in most such networks. The best fit to the population averaged neural data in (b) (same as left panel in Fig. 4b) of the states of each model class are shown (first five columns). These state coding models showed increasing selectivity over time from hidden to output layers and did not demonstrate the strong reversal of stimulus preference in their hidden processing stage (green lines) as observed in the pIT neural population.          (21)(40) is computationally similar to (vi), but it differs at the implementation level by specifically positing that error information (rather than state information) is passed to the next higher cortical stage. (b) Between-stage and within-stage connectivity diagrams corresponding to the models in (a). Between-stage errors (black circles), measuring reconstruction performance, are computed in a similar fashion across models and can drive efficient hierarchical learning when coupled with state signals (white circles) (bottom four networks). The state and error computing networks only differ in the details of how error signals and state signals interact during inference and learning. Our data provide evidence for this large family of error-computing networks and rule out pure state-estimating models and variants including normalization and lateral inhibition (Fig. 5). The present data do not distinguish between the autoencoder and error backpropagation classes when directly compared (Supplementary Fig. 2); however, the stronger presence of state-like signals in the superficial cortical layers (Fig. 6c) argues against the predictive coding models in (iii) and (vii).