Machine Learning and Prediction in Medicine — Beyond the Peak of Inflated Expectations

Big data, we have all heard, promise to transform health care. But in the “hype cycle” of emerging technologies, machine learning now rides atop the “peak of inflated expectations,” and we need to better appreciate the technology’s capabilities and limitations.

B ig data, we have all heard, promise to transform health care with the widespread capture of electronic health records and high-volume data streams from sources ranging from insurance claims and registries to personal genomics and biosensors. 1 Artificial-intelligence and machinelearning predictive algorithms, which can already automatically drive cars, recognize spoken language, and detect credit card fraud, are the keys to unlocking the data that can precisely inform real-time decisions. But in the "hype cycle" of emerging technologies, machine learning now rides atop the "peak of inflated expectations." 2 Prediction is not new to medicine. From risk scores to guide anticoagulation (CHADS2) and the use of cholesterol medications (ASCVD) to risk stratification of patients in the intensive care unit (APACHE), data-driven clinical predictions are routine in medical practice. In combination with modern machine learning, clinical data sources enable us to rapidly generate prediction models for thousands of similar clinical questions. From early-warning systems for sepsis to superhuman imaging diagnostics, the potential applicability of these approaches is substantial.
Yet there are problems with real-world data sources. Whereas conventional approaches are largely based on data from cohorts that are carefully constructed to miti-gate bias, emerging data sources are typically less structured, since they were designed to serve a different purpose (e.g., clinical care and billing). Issues ranging from patient self-selection to confounding by indication to inconsistent availability of outcome data can result in inadvertent bias, and even racial pro filing, in machine predictions. Awareness of such challenges may keep the hype from outpacing the hope for how data analytics can improve medical decision making.
Machine-learning methods are particularly suited to predictions based on existing data, but precise predictions about the distant future are often fundamentally impossible. Prognosis models for HER2-negative breast cancer had to be inverted in the face of targeted therapies, and the predicted efficacy of influenza vaccination varies with disease prevalence and community immunization rates. Given that the practice of medicine is constantly evolving in response to new technology, epidemiology, and social phenomena, we will always be chasing a moving target.
The rise and fall of Google Flu remind us that forecasting an annual event on the basis of 1 year of data is effectively using only a single data point and thus runs into fundamental time-series problems. 3 Yet if the future will not necessarily resemble the past, simply accumulating mass data over time has diminishing returns.
Research into decision-support algorithms that automatically learn inpatient medical practice patterns from electronic health records reveals that accumulating multiple years of historical data is worse than simply using the most recent year of data. When our goal is learning how medicine should be practiced in the future, the relevance of clinical data decays with an effective "half-life" of about 4 months. 4 To assess the usefulness of prediction models, we must evaluate them not on their ability to recapitulate historical trends, but instead on their accuracy in predicting future events.
Although machine-learning algorithms can improve the accuracy of prediction over the use of conventional regression models by capturing complex, nonlinear relationships in the data, no amount of algorithmic finesse or computing power can squeeze out information that is not present. That's why clinical data alone have relatively limited predictive power for hospital readmissions that may have more to do with social determinants of health.
The apparent solution is to pile on greater varieties of data, including anything from sociodemographics to personal genomics to mobile-sensor readouts to a patient's credit history and Web-browsing logs. Incorporating the correct data stream can substantially improve predictions, but even with a deterministic (non-Machine Learning and Prediction in Medicine n engl j med 376;26 nejm.org June 29, 2017 random) process, chaos theory explains why even simple nonlinear systems cannot be precisely predicted into the distant future. The so-called butterfly effect refers to the future's extreme sensitivity to initial conditions. Tiny variations, which seem dismissible as trivial rounding errors in measurements, can accumulate into massively different future events. Identical twins with the same observable demographic characteristics, lifestyle, medical care, and genetics necessarily generate the same predictions -but can still end up with completely different real outcomes.
Though no method can precisely predict the date you will die, for example, that level of precision is generally not necessary for predictions to be useful. By reframing complex phenomena in terms of limited multiple-choice questions (e.g., Will you have a heart attack within 10 years? Are you more or less likely than average to end up back in the hospital within 30 days?), predictive algorithms can operate as diagnostic screening tests to stratify patient populations by risk and inform discrete decision making.
Research continues to improve the accuracy of clinical predictions, but even a perfectly calibrated prediction model may not translate into better clinical care.
An accurate prediction of a patient outcome does not tell us what to do if we want to change that outcome -in fact, we cannot even assume that it's possible to change the predicted outcomes.
Machine-learning approaches are powered by identification of strong, but theory-free, associations in the data. Confounding makes it a substantial leap in causal inference to identify modifiable factors that will actually alter outcomes. It is true, for instance, that palliative care consults and norepinephrine infusions are highly predictive of patient death, but it would be irrational to conclude that stopping either will reduce mortality. Models accurately predict that a patient with heart failure, coronary artery disease, and renal failure is at high risk for postsurgical complications, but they offer no opportunity for reducing that risk (other than forgoing the surgery). Moreover, many such predictions are "highly accurate" mainly for cases whose likely outcome is already obvious to practicing clinicians. The last mile of clinical implementation thus ends up being the far more critical task of predicting events early enough for a relevant intervention to influence care decisions and outcomes. 5 With machine learning situated at the peak of inflated expec-tations, we can soften a subsequent crash into a "trough of disillusionment" 2 by fostering a stronger appreciation of the technology's capabilities and limitations. Before we hold computerized systems (or humans) up against an idealized and unrealizable standard of perfection, let our benchmark be the real-world standards of care whereby doctors grossly misestimate the positive predictive value of screening tests for rare diagnoses, routinely overestimate patient life expectancy by a factor of 3, and deliver care of widely varied intensity in the last 6 months of life.
Although predictive algorithms cannot eliminate medical uncertainty, they already improve allocation of scarce health care resources, helping to avert hospitalization for patients with low-risk pulmonary embolisms (PESI) and fairly prioritizing patients for liver transplantation by means of MELD scores. Early-warning systems that once would have taken years to create can now be rapidly developed and optimized from realworld data, just as deep-learning neural networks routinely yield state-of-the-art image-recognition capabilities previously thought to be impossible.
Whether such artificial-intelligence systems are "smarter" than human practitioners makes for a stimulating debate -but is largely irrelevant. Combining machinelearning software with the best human clinician "hardware" will permit delivery of care that outperforms what either can do alone. Let's move past the hype cycle and on to the "slope of enlightenment," 2 where we use every information and data resource to consistently improve our collective health.

With machine learning situated at the peak of inflated expectations, we can soften a subsequent crash into a "trough of disillusionment" by fostering a stronger appreciation of the technology's capabilities and limitations.
T he health care market is undergoing rapid transformation, spurred in part by the Affordable Care Act (ACA) and recent payment reforms introduced by the Centers for Medicare and Medicaid Services (CMS). The industry is shifting from a business-to-business model involving insurers, health care providers, and pharmaceutical companies -which traditionally sheltered patients from financial and medical decisions -to a businessto-consumer model in which the patient (the consumer) is at the center of decision making. Innovations in health care technology are also rapidly expanding access to information and, in some cases, disinformation. Access to accurate and timely information empowers patients to make wellinformed choices about their care. The centerpiece of this consumercentric revolution in health care is shared decision making.
In December 2016, CMS announced a national pilot of the Beneficiary Engagement and Incentives Models, launched under the authority of Section 1115A of the Social Security Act, which was added by the ACA. These models employ shared-decision-making tools, including decision aids and target preference-sensitive treatments such as joint replacement. CMS proposed two models: a Shared Decision Making Model and a Direct Decision Support Model. The Shared Decision Making Model will test an approach for integrating a structured fourstep shared-decision-making process into clinical practice for clinicians in accountable care organizations. It's expected to engage more than 150,000 Medicare beneficiaries annually and will pay participating organizations $50 for each shared-decisionmaking service provided by their clinicians.
The Direct Decision Support Model, on the other hand, will target organizations that provide health management and decision support services. CMS will partner with up to seven organizations to support approximately 700,000 Medicare beneficiaries each year. Although the initiative's goals are to improve the quality of decision making and patient engagement in the care process, an implicit assumption is that well-informed patients might choose to receive less care, thereby reducing costs. It's also important to note that shareddecision-making tools aren't just about eliciting and honoring patient preferences -they could also help address the health care industry's market and regulatory failures, such as paying for unnecessary care.
It's not surprising that elective knee and hip replacement are among the preference-sensitive treatments targeted by CMS as part of its national effort to promote shared decision making. Osteoarthritis of the knee and hip is among the most prevalent chronic conditions in the United States. Joint replacement is one of the most successful surgical procedures in history, and the substantial evidence supporting its effectiveness and safety has made it one of the most commonly performed elective surgeries in elderly patients. Furthermore, use of joint replacement is projected to grow rapidly during the next decade as end-stage lowerextremity osteoarthritis, a pro-