Enhancing the Wisdom of the Crowd With Cognitive Process Diversity: The Benefits of Aggregating Intuitive and Analytical Judgments

Drawing on dual process theory, we suggest that the benefits that arise from combining different individual judgments will be heightened when these judgments are based on different cognitive processes. We test this hypothesis in two experimental studies in which participants were prompted to make judgments relying on an analytical process, on their intuition, or in a control condition in which no particular instructions were given. Our results show that an aggregation of intuitive and analytical judgments provides more accurate estimates than any other aggregation procedure, and that this advantage increases with the number of judgments that are aggregated. Moreover, we find evidence that this result is driven by a lower correlation between errors of intuitive and analytical judgments compared to errors of judgments that are based on the same cognitive process or judgment errors in the control condition.

Building on this view, we do not aim to establish the relative advantages of one cognitive process over the other with respect to their effect on judgment accuracy. Rather, we propose that because both systems at least partially rely on different information and mechanisms to form judgments, these two cognitive processes can be expected to produce judgments (and hence judgment errors) that are less systematically correlated with each other-compared to judgments that are resulting from the predominant use of only an analytical or only an intuitive cognitive process. Since as previously discussed the benefits of judgment aggregation depend strongly on the level of independence among individual judgment errors, this then directly implies that aggregating judgments from two different types of cognitive processes-that is forming crowds with a high level of cognitive process diversity-should be superior to forming less diverse crowds by aggregating judgments of the same type. This line of reasoning is summarized in Hypothesis 1.

Hypothesis 1:
Crowds with a high level of cognitive process diversity formed by aggregating intuitive and analytical judgments will be more accurate than less diverse crowds formed by aggregating only analytical or only intuitive judgments.
Importantly, we also expect that the effect of cognitive process diversity on crowd accuracy will depend on the size of the crowd. In general, as we outlined previously the benefits that arise from aggregating individual judgments depend strongly on the extent to which these judgments bracket the true value. Bracketing in turn is more likely to happen when individual judgments are less correlated with each other or when the crowd is larger, but the relationship between these three factors is relatively intricate 1 . 1 In addition, the extent of "bracketing" and hence the benefits of aggregation might also be affected by each judge's bias across judgments, that is, a judge's tendency to systematically under-or overestimate the true values across a number of judgments (Davis-Stober, Budescu, Dana, & Broomell, 2014;. We provide a discussion of this factor in the appendix. In particular, in very small crowds, even if individual judgments were to be completely independent, since there are only very few of them, it is still quite likely that these judgments will often not be evenly distributed on both sides of the true value-that is, these judgments will not "bracket" the true value and the benefits arising from judgment aggregation will only be relatively small. However, as the crowd gets larger, the positive effect of judgment independence on crowd accuracy is more likely to be fully realized. This is because in large crowds, if judgments have a high level of independence, due to the law of large numbers, they will be distributed quite evenly around the true value, and therefore bracketing will occur much more frequently than in small crowds. Based on this line of reasoning, we therefore expect that the positive effects of cognitive process diversity on crowd accuracy, which arises from a higher level of independence between individual judgment errors, will gradually increase as the crowd gets larger. Overall, this line of reasoning is also strongly consistent with the theoretical results by Lamberson and Page (2012) and Davis-Stober,  who formally show that the relative effect of independence (as measured by the covariance of individual errors) on judgment accuracy increases in crowd size. We thus predict: Hypothesis 2: The benefits arising from cognitive process diversity will be greater in larger than in smaller crowds.
In the following we present the results of two experimental studies designed to test these hypotheses. For both studies we will focus on a general presentation of our measures and main results, and provide a more detailed description of our measures and additional results in the appendix.

Design and procedure
Electronic copy available at: https://ssrn.com/abstract=3319676 We recruited 158 participants (90 women and 68 men; Mage = 24 years) at a European university for a laboratory experiment. Six participants did not follow the instructions accurately and were removed from the sample, resulting in a final sample size of 152. Participants were randomly assigned to one of three conditions: analytical (n = 48) intuitive (n = 51), or the control condition (n = 53).
In all conditions, participants were placed at an individual computer where they provided answers to 40 questions about the dates of historical events 2 . Each participant received a fixed payment of €6 and could win an additional €6 based on their performance during the study which was calculated based on the judgments' absolute deviations from the true value. Following prior research (e.g., Dane et al., 2012), in the intuitive condition we instructed participants to base their decisions entirely on their intuition, and to avoid consciously thinking about what the right answer is. In addition, participants were given only seven seconds to answer each of the questions. On the other hand, to induce analytical judgments, we instructed participants to carefully think about the particular reasons for their judgment and to ignore any first impressions or "gut instincts" that might arise. Finally, in the control condition participants were also given unlimited time, but were not provided with any specific instructions on how to make judgments.

Measures
Crowd accuracy. Crowds were created by randomly selecting (with replacement) individual judges and averaging their judgments. We formed four different crowd types by drawing judges only from the analytical condition (ANL), only from the intuitive condition 2 For our analysis. statistical power is determined by the number of questions rather than the number of participants in each condition (e.g., Budescu & Chen, 2014;Palley & Soll, 2018;Mannes et al. 2014). For paired t-tests (with an assumed correlation of 0.5) a total of 40 questions provides a satisfactory statistical power of 0.87 for a medium effect size of 0.5 (e.g., Mannes et al., 2014 observed an effect size of 0.51 for a related manipulation to improve crowd wisdom). The number of participants was determined such that we would get a reasonably good approximation of the population parameters for each question. Assuming a standard deviation of 200 (which we observed in an unrelated prior study with similar knowledge items), a sample size of 50 per condition allows us to estimate the true population mean with 95% confidence within an error of ±55 years.
(INT), only from the control condition (CON), or equally from the analytical and the intuitive condition (ANL-INT). For all these crowd types, we also varied the size of the crowd, letting it range from 1 to 50. For example, to form an ANL crowd of size ten, we randomly drew ten judges with replacement from the analytical condition, whereas to form an ANL-INT crowd of the same size, five judges were drawn from the analytical and five from the intuitive condition.
For each crowd, we calculated the mean of individual judgments for each question and determined the corresponding judgment accuracy-defined as the absolute deviation from the true value (e.g., de Oliveira & Nisbett, 2018;Minson, Mueller, & Larrick, 2018;Palley & Soll, 2018 . We repeated this procedure for 1000 times for each crowd type and size and averaged across the 1000 trials.

Judgment error independence.
To directly examine the level of independence between individual judgment errors within a crowd (e.g., Lamberson & Page, 2012), we measured the average pairwise correlation of signed errors of any two judges who were randomly drawn from either the same or from different conditions across 1000 trials.
Manipulation check and expertise. We adapted five items ( = 0.78) from Dane et al. (2012) to assess to what extent judges made their judgments using an intuitive or analytical process (e.g., "I based my judgments on my inner feelings and reactions"). Moreover, we employed three items ( = 0.91) to measure judges' expertise in historical events (e.g., "I know a lot about different historical events"). All items were assessed on a scale from 1 = "not at all" to 7 = "very much".

Results
Participants reported to have an average level of expertise in historical events (M = 3.77, SD = 1.40). In both studies, we also explored crowd judgments formed by aggregating individual judgments of only high or only low expertise, but did not find systematic differences from the results reported here. As expected, we found that participants relied significantly more on an  Table 1 shows crowd judgment accuracy aggregated over all questions. We found that across all crowd types even small crowds of size two provided more accurate judgments than individual judges. Moreover, judgment accuracy increased in crowd size, but once crowds were larger than 20, increasing crowd size further only had a very small effect.
We next compared the judgment accuracy of ANL-INT crowds to that of all other crowd types using paired t-tests. In both studies we also used non-parametric Wilcoxon signed-rank test and obtained results consistent with those reported here. Table 2 presents the results of this comparison as well as pairwise comparisons of the accuracy of ANL, INT, and CON crowds.
We did not find significant differences in judgment accuracy between randomly selected individual judges across different conditions, nor between ANL, INT, or CON crowds of any size. In contrast, providing support for Hypothesis 1, for all crowd sizes, ANL-INT crowds were significantly more accurate than ANL, INT, or CON crowds. To test the prediction of Hypothesis 2 that the benefits of cognitive process diversity will be more pronounced in larger crowds, we computed the difference in accuracy between ANL-INT and other crowd types in small and large crowds. Supporting Hypothesis 2, we found that the difference in judgment accuracy between ANL-INT and INT crowds, t(39) = 2.13, p = .039, d = 0.25, between ANL-INT and ANL crowds, t(39) = 3.88, p < .001, d = 0.46, and between ANL-INT and CON crowds, t(39) = 4.30, p < .001, d = 0.25, was significantly larger in crowds of size 50 than in crowds of size two.
In line with these results, we also found that the average pairwise correlation of signed errors was ̅ = 0.18 for judges drawn only from the analytical condition, ̅ = 0.10 for judges drawn only from the intuitive condition, and ̅ = 0.19 for judges from the control condition, but only ̅ = −0.03 for judges drawn equally from the intuitive and analytical conditions.

Study 2 Design and procedure
Participants in Study 2 were randomly assigned to one of three conditions (intuitive, n = 34; analytical n = 33; and control, n = 31) and asked to estimate the probabilities of the three possible outcomes (team 1 wins, draw, team 2 wins) for all 48 matches in the group stage of the 2018 soccer World Cup.
We recruited 98 participants (51 women and 47 men; Mage= 26 years) from a European University 4 who completed the study online. Participants were rewarded with course credit, and had the opportunity to win an Amazon voucher worth up to €40 depending on their performance, which was assessed based on a quadratic scoring rule once the World Cup was over. For each match participants used sliders to enter their probability judgments for all three possible match outcomes which were programmed such that the stated probabilities always summed up to 100%.
Our manipulation of participants' cognitive processes was very similar to that used in Study 1, except that participants in the intuitive condition now had 10 seconds to enter their estimates.

Measures
Crowd accuracy. As there is no objectively true value of each outcome's probability, following previous research, we used probabilities implied in the betting odds provided by sports betting companies as a benchmark (calculated as the normalized inverse of the provided odds; for details see the appendix) and computed the absolute deviation from this benchmark as our measure of judgment accuracy. Betting odds are among the best available predictors for sport events (e.g., Boulier, Stekler, & Amundson, 2006;Spann, & Skiera, 2009). They thus constitute an upper bound for judgment accuracy of individuals without very specific expert knowledge, making them an adequate benchmark for the quality of crowd judgments (e.g., Herzog & Hertwig, 2011;Palley & Soll, 2018). Using the same general procedure as in Study 1, we created crowds of different types and sizes by randomly drawing participants from the corresponding conditions, averaging their judgments for each match and each outcome, computing the absolute deviation from the benchmark probabilities, and averaging the absolute deviation across all outcomes and all matches. We repeated this procedure for 1000 times for each crowd type and size and then averaged across the 1000 trials. points (estimated based on a small pilot study with 9 participants). Our statistical power is determined by the number of matches (48) which provided us with an acceptable power of 0.78 assuming an effect size of 0.4 which we observed in Study 1.

Judgment error independence.
We again computed the average pairwise correlation in signed errors between any two individual judges randomly drawn from either the same or different conditions following the same general procedure as described in Study 1.
Brier scores. In addition to our main measure of judgment accuracy, we also computed Brier scores (Brier, 1950) for all crowds based on the actual match outcomes. In particular, the Brier score for a particular match was calculated as: , where denotes the probability estimate of a particular outcome of match , and is an indicator that equals one if the outcome of match was and zero otherwise.
Manipulation check and expertise. Participants answered the same five questions ( = .79) asking for their cognitive process during the experiment as in Study 1. In addition, we employed three items ( = .87) to assess judges' expertise in professional soccer (e.g., "I am very interested in professional soccer") on a 1 = "not at all" to 7 = "very much" scale. As shown in Table 3, for all crowd types we find that even small crowds outperform individual judgments, and that crowd accuracy increases in crowd size in a concave fashion.

Participants
To test Hypothesis 1, we compared judgment accuracy of ANL-INT crowds with other crowd types using paired t-tests as shown in Table 4.
Our analysis did not reveal significant differences in the accuracy of individual judgments across conditions or that of ANL, INT or CON crowds of any crowd size. In contrast, as predicted in Hypothesis 1, we found that judgments by ANL-INT crowds were significantly more accurate than those by ANL, INT, or CON crowds. In line with Hypothesis 2 our comparison of crowds with the minimum size (two) to those with the maximum size (50)

General Discussion
Our results from two experimental studies showed that forming crowds with a high level of cognitive process diversity-by aggregating a combination of intuitive and analytical individual judgments-improves the quality of crowd wisdom, compared to crowds formed by an aggregation of only analytical judgments, only intuitive judgments, or of judgments made in a control condition without specific manipulation of judges 'cognitive processes. Effect sizes were mostly in the small to medium range depending on the crowd size. Moreover, we found that whereas the benefits of crowd cognitive process diversity generally held for both smaller and larger crowds (with the exception of Brier Scores in Study 2), the magnitude of these benefits increases in crowd size and eventually approaches its maximum as crowds become very large.
Providing supporting evidence for the suggestion that the benefits of cognitive process diversity are driven by higher levels of judgment error independence, we also observed a lower average correlation in signed errors between judges employing an intuitive and analytical cognitive process, than between judges relying on the same cognitive process or judges in the control condition.
Our findings contribute to previous work on the statistical aggregation of individual judgments and crowd wisdom. Previous research has suggested a number of procedures to increase the benefits that arise from judgment aggregation, for example through selecting individuals whose judgments are most beneficial for crowd accuracy (e.g., Budescu & Chen, 2014;Mannes et al., 2014), or through refining the statistical procedure used to aggregate individual judgments (e.g., Jose & Winkler, 2008;Palley & Soll, 2018). Different from this work, our approach focuses on increasing independence between individual judgment errors through manipulating the basic cognitive process employed by individual judges to form their judgments.
It thus also complements recent work by de Oliveira and Nisbett (2018) who investigated the possibility of improving crowd wisdom by amplifying the demographic diversity of crowds, and found that this approach was largely ineffective.
In addition, our findings add to the ongoing debate about the relative quality of intuitive and analytical judgments (e.g., Dane & Pratt, 2007;Dane et al., 2012;Hogarth 2010, Phillips et al., 2016. Recognizing that both of these processes have particular advantages and drawbacks, previous theoretical work has already pointed out the importance of finding ways to incorporate both intuition and analytical thinking into decision making in order to reap the benefits of both (e.g., Dane & Pratt, 2007;Hogarth, 2010). Our findings demonstrate how this could be relatively easily achieved in the context of quantitative judgments by simply averaging judgments of individuals employing different cognitive processes.
One important limitation of our results is that we only considered quantitative judgments.
Even though such judgments are clearly relevant in a variety of situations, in many other cases managers actually choose between a small number of discrete options (e.g., whether project A is superior to project B). It would be interesting to explore to what extent our approach of relying on a combination of intuitive and analytical judgments might also help to improve the performance of plurality rules that are frequently used in such cases to aggregate judgments (see e.g., Hastie & Kameda, 2005). Another important issue that should be investigated in future research is the specific task domains in which our results are valid. For example for tasks that require a very high level of information processing or the application of formal logic to arrive at a judgment, intuitive judgments might frequently be significantly less accurate than analytical ones.
In this case, even though adding an intuitive judgment to a crowd composed of analytical Electronic copy available at: https://ssrn.com/abstract=3319676 judgments might provide benefits due to its lower correlation in errors, it also has the downside of adding a highly inaccurate judgment and thus potentially decreasing overall crowd accuracy, especially when crowds are relatively small (Mannes et al., 2014). A final interesting avenue for future research would be to explore if our approach towards improving crowd wisdom might also help to increase the effectiveness of combining judgments that are made by the same individual (e.g., Herzog & Hertwig, 2009;Vul & Pashler, 2008). This could e.g. be achieved by asking a judge to first make a fast intuitive judgment followed by an analytical judgment with unlimited time, and then averaging both judgments.

Author contributions
Both authors developed the study idea and the design of the laboratory experiments. Data collection and analysis was carried out by S. Keck with input from W. Tang. Both authors jointly wrote the manuscript and approved the final version of the manuscript for submission.
Electronic copy available at: https://ssrn.com/abstract=3319676 Tables   Table 1 Absolute deviations across crowd types and sizes (Study 1) Note: *Randomly selected individual judgments.  Table 3 Absolute deviations across crowd types and sizes (Study 2) Note: *Randomly selected individual judgments.

Calculation of judgment independence
To compute the average correlation in judgment errors in a given trial , we randomly picked two judges with replacement, both from the analytical judgment condition, both from the intuitive judgment condition, both from the control condition, or one each from the intuitive and the analytical condition respectivley, providing us with a measure for each of the four different crowd types. Denote the estimates of these two judges for question as and , 1, 2 , … , 40. We then computed the corresponding signed deviations as and , and then the correlation coefficient of these two sets of signed deviations over all 40 question, denoted as . We repeated this procedure for 1000 times and computed the average pairwise correlation coefficient, i.e., ̅ 1/1000 ∑ .
Another potential determinant of crowd judgment accuracy besides correlation across judges comes from each crowd member's individual level of bias, i.e., each individual's own particular tendency to systematically under-or overestimate the true value across knowledge items. As pointed out in prior work if individual judges exhibit such a high level of bias, crowd wisdom will be strongly decreased even for large crowds (e.g., Davis-Stober et al., 2015). To explore the effect of this factor across conditions we computed a judge 's bias in a given condition as: ∑ . Note that this variable measures the level of bias of each individual across all questions, and is conceptually different from the commonly used standardized bias score (Einhorn et al., 1977)

Descriptive results, measures and additional analysis for Study 2
Items to measure expertise (i) "I am very interested in professional soccer" (ii) "I know a lot about different national soccer teams" (iii) "I spend a lot of time watching international soccer games such as the world cup".

Calculation of betting odds
During the same two weeks before the start of World Cup in which conducted our data collection, we obtained the decimal betting odds for the three possible outcomes of each game (team 1 wins, draw, and team 2 wins) from the websites of two major online sports betting firms, "Pinnacle" and "Betways". We then averaged the odds provided on the two respective websites and computed the probabilities (rounded to the next integer value) for each outcome implied in these odds. In order for sport betting firms to be profitable, their provided odds should be equal to the inverse of the probability of a particular outcome plus an additional profit margin called "overround" which for our two data sources is typically around 3-5%.
Thus, to obtain the implied probabilities we first calculated the inverse of the decimal odds, and then normalized the three implied probabilities by dividing each probability by the sum of the three probabilities. For example, suppose that the odds provided by the betting company for a match are 2.9, 3.6, and 2.4 for the three possible outcomes. In this case, we would calculate the corresponding inverse of the decimal odds as 0.34, 0.28, and 0.42, and then divide each probability by 1.04 to reach the normalized probabilities 0.33, 0.27, and 0.40.

Crowd judgment accuracy
We first formed simulated crowds of a given type in the same way as in Study 1. Denote , , … , as the individual probability estimates in the simulated crowd k for match and outcome . We computed the mean of these probability estimates and took it as the estimate for match and outcome from crowd that has a size of , that is, ⋯ / , and the corresponding accuracy measure was then computed as | | average absolute deviation from probabilities implied in betting odds for match and outcome across all iterations: | | ∑ Bet .

Judgment independence
In a given trial , we randomly picked two judges, both from the analytical judgment condition, both from the intuitive judgment condition, both from the control condition, or one from the analytical and one from the intuitive condition, whose probability estimates for match and outcome are denoted as and . We then computed the corresponding signed deviations as Bet and Bet , and then the correlation coefficient of these two sets of signed deviations over all 48 matches and 3 outcomes per match, denoted as . We repeated this procedure (with replacement) for 1000 times and computed the average pairwise correlation coefficient, i.e., ̅ ∑ .

Analysis of individual level bias
We computed a judge 's level of bias in a given condition as: Electronic copy available at: https://ssrn.com/abstract=3319676 Electronic copy available at: https://ssrn.com/abstract=3319676