Developing WHO guidelines: Time to formally include evidence from mathematical modelling studies

In recent years, the number of mathematical modelling studies has increased steeply. Many of the questions addressed in these studies are relevant to the development of World Health Organization (WHO) guidelines, but modelling studies are rarely formally included as part of the body of evidence. An expert consultation hosted by WHO, a survey of modellers and users of modelling studies, and literature reviews informed the development of recommendations on when and how to incorporate the results of modelling studies into WHO guidelines. In this article, we argue that modelling studies should routinely be considered in the process of developing WHO guidelines, but particularly in the evaluation of public health programmes, long-term effectiveness or comparative effectiveness. There should be a systematic and transparent approach to identifying relevant published models, and to commissioning new models. We believe that the inclusion of evidence from modelling studies into the Grading of Recommendations Assessment, Development and Evaluation (GRADE) process is possible and desirable, with relatively few adaptations. No single “one-size-fits-all” approach is appropriate to assess the quality of modelling studies. The concept of the ‘credibility’ of the model, which takes the conceptualization of the problem, model structure, input data, different dimensions of uncertainty, as well as transparency and validation into account, is more appropriate than ‘risk of bias’.

This article is included in the gateway. TDR 1,2 2 1 1

Introduction
Mathematical models have a long history in public health 1  In recent years, the number of publications related to mathematical modelling has increased steeply. Today, mathematical modelling studies are not restricted to infectious diseases but address a wide range of questions.
The World Health Organization (WHO) provides recommendations on many public health, health system and clinical topics. WHO guidelines are developed using processes and methods that ensure the publication of high-quality recommendations, as outlined in the WHO Handbook for Guideline Development 2 . WHO uses the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach to rate the certainty of a body of evidence and to produce information that is used by guideline panels to formulate recommendations, based on the balance of benefits and harms and other considerations 3 .
Many of the questions addressed in mathematical modelling studies are relevant to the development of guidelines. Increasingly, WHO and other guideline developers need to decide whether and how the results of mathematical modelling studies should be included in the evidence base used to develop recommendations. We reviewed the 185 WHO guidelines that were approved by the Guidelines Review Committee from 2007 to 2015: 42 (23%) referred to mathematical modelling studies. However, these studies were rarely formally assessed as part of the body of evidence, and quality criteria for modelling studies were often lacking. A major barrier to the incorporation of evidence from mathematical modelling studies into guidelines is the perceived complexity of the methods used to construct and analyse these studies. At present, there are no widely agreed methods for, or approaches to, the evaluation of the results of mathematical modelling studies, and to their integration with primary data to inform guidelines and recommendations. In April 2016 WHO organized a workshop in Geneva, Switzerland to discuss when and how to incorporate the results of modelling studies into WHO guidelines (see Acknowledgements for names of participants). Specifically, the workshop participants discussed the following three questions:: (1) When is it appropriate to consider modelling studies as part of the evidence that supports a guideline?
(2) How should the quality and risk of bias in mathematical modelling studies be assessed?
(3) How can the GRADE approach be adapted to assess the certainty of a body of evidence that includes the results of modelling and to formulate recommendations?
A detailed workshop report is available from WHO 4 .
The role of modelling in economic evaluation is well recognised in guideline development and at WHO, and was therefore excluded from discussions. At the workshop, we considered the results of a survey of experts (see Box 1) and a rapid literature review (see below). In this paper, which reflects the opinions of

Amendments from Version 1
We have clarified and elaborated upon the distinctions between mathematical and statistical modelling and between a mathematical model and a mathematical modelling study. We use a broad definition of mathematical models which encompasses both descriptive and predictive aspects. Statistical modelling, on the other hand, typically characterizes sources of variation and associations between variables in observed populations of interest. We also elaborate on the GRADE domain of risk of bias as part of the assessment of certainty of a body of evidence for important and critical outcomes. We feel that the concept of risk of bias is too narrow in the context of mathematical modelling studies and prefer to use "credibility" which encompasses not only by risk of bias of the input data, but also conceptualization of the problem, model structure, other dimensions of uncertainty, transparency, and validation.

Box 1. Web-based expert survey on the role of mathematical modelling in guideline development
The survey was conducted between March 17 and April 4, 2016. It consisted of 10 questions: four on the characteristics of the respondents, three on the role of mathematical models in guideline development, two questions on quality criteria for mathematical models and one on the challenges in using mathematical modelling in guideline development (see Figure S1). Using snowball sampling, mathematical modellers, epidemiologists, guideline developers and other experts were invited to participate in the survey. A total of 151 individuals from 28 countries and 87 different institutions responded. About half of respondents were modellers, and the other half users of the results from modelling studies. The majority of respondents (58%) had been part of a guideline development group in the past.
Ninety-five percent of respondents answered yes to the question "Should mathematical modelling inform guidance for public health interventions?" and 60% indicated that findings of mathematical modelling studies can sometimes provide the same level of evidence as those of empirical research studies. When asked to list situations in which mathematical modelling could be particularly useful for the development of guidelines, the absence of empirical data on the effectiveness, cost-effectiveness and impact of an intervention, and on the comparative effectiveness of different interventions was most frequently mentioned. We also asked about situations where mathematical modelling studies should not be used or have been inappropriately used in the development of guidelines. Respondents reported that modelling should not be used "to cover up" for the lack of evidence from empirical research, and due emphasis should be given to the uncertainty of model predictions. When asked about the five most important criteria for the quality of reporting of modelling studies, respondents mentioned that the model structure should be clearly described and justified, the important sources of uncertainty reported, and model validity addressed. Assumptions should be clearly stated, justified and discussed and the sources of parameter estimates described. Finally, respondents identified the interpretation of results from modelling studies, the evaluation of their quality and the communication of uncertainty as major challenges in using mathematical modelling in guideline development. These challenges would be best addressed by including at least one modelling expert in guideline development groups.
the authors but not necessarily that of all workshop participants, we first define models and modelling studies. We then address the three questions outlined above and conclude with some recommendations on the use of evidence from modelling studies in guidelines development.
What is a mathematical modelling study?
Using a common terminology across different disciplines, for example infectious disease modelling and modelling in chronic disease, will facilitate the assessment, evaluation and comparison of mathematical modelling studies. A broad definition of a mathematical model is a "mathematical framework representing variables and their interrelationships to describe observed phenomena or predict future events" 5 . We make a distinction between a mathematical model and mathematical modelling studies, which we define as studies that address defined research questions using mathematical modelling. Mathematical modelling studies typically address complex situations and tend to rely more heavily on assumptions about underlying mathematical structure than on individual-level data. Examples include investigating the potential of HIV testing with immediate antiretroviral therapy to reduce HIV transmission 6 , or the likely impact of different screening practices on the incidence of cervical cancer 7 .
Statistical modelling is typically concerned with characterizing sources of variation and associations between variables in observed individual-level data drawn from a target population of interest and tends to address questions of a narrower scope than mathematical models. Both statistical and mathematical models can be used to predict future outcomes and to compare different policies. The results from statistical analyses of empirical data often inform mathematical models. Mathematical modelling studies also increasingly integrate statistical models to relate the model output to data.
Workshop participants discussed whether it might be helpful for guideline groups to classify mathematical models in terms of their scope (for example descriptive versus predictive), or technical approach (for example static versus dynamic) 8 . Discussants argued that a good understanding of what information models can provide and what level of confidence can be placed in that information was more important than a detailed taxonomy of models 4 .

Role of mathematical modelling studies in guideline development
Mathematical models typically address questions that cannot easily be answered with randomized controlled trials (RCTs) or observational studies.

Examples of relevant mathematical modelling studies
The long-term effectiveness or costeffectiveness of an intervention is unclear.
Life time effect on decompensated cirrhosis of obeticholic acid as secondline treatment in primary biliary cholangitis 9 . Outcomes and costs over 10 years of donepezil treatment in mild to moderately severe Alzheimer's Disease 10 . Long-term clinical outcomes, costs and cost-effectiveness of interventions in diabetes mellitus (types 1 and 2) 11 .
The outcomes of an intervention in real world, routine care settings are unclear.
Outcomes of medical management of asymptomatic patients with carotid artery stenosis who were excluded from clinical trials 12 . Effects on blood pressure and cardiovascular risk of variations in patients' adherence to prescribed antihypertensive drugs 13 . settings, to long term outcomes, and to bridge the gap between efficacy and (long-term) effectiveness 23 . Second, interventions to prevent and control infectious diseases have non-linear effects. RCTs that address short term effects at the individual level might not be suitable for estimating the longer term effects of introducing an intervention, say a vaccine, in a whole population if indirect herd effects influence the incidence of infection and hence the impact of the intervention 24,25 . Third, rapid guidance is often needed early in outbreaks or public health emergencies when relevant interventions for prevention or management might simply not have been evaluated. The results of mathematical modelling studies can be used to draft emergency guidelines or to assess the epidemic potential of new outbreaks 26 .
The findings of mathematical modelling studies are only as good as the data and assumptions that inform them. Guideline recommendations should therefore not be based on the outputs of models when uncertainty in the empirical data has not been appropriately quantified, when the model makes implausible assumptions or has not been validated adequately, or when the model predictions vary widely over a plausible range of parameter estimates.

Assessing the quality of a mathematical modelling study: Rapid review
We performed a rapid review of the methodological literature to identify criteria that are proposed to assess the "quality" of mathematical modelling studies (see Table S1 for the detailed search strategy). Specifically, we aimed to identify criteria proposed to assess the quality of single mathematical modelling studies, including best practice standards or criteria for assessing risk of bias or reporting quality and criteria proposed to assess the quality of a body of evidence from mathematical modelling studies. We were also interested in identifying checklists or other instruments developed to assess the quality of mathematical modelling studies.
We identified 20 relevant articles (see Figure 1 for a flow chart of the identification of eligible articles) 25,27-44 . Most gave recommendations for good modelling practice and were compiled by a task force in a consensus process or based on a systematic or narrative review of the literature. The widely cited 2003 paper by Weinstein and colleagues organized 28 recommendations under the headings "structure", "data", and "validation" 31 . A questionnaire or checklist was not included. A subsequent series of seven articles 25,38-42,44 by the joint International Society for Pharmacoeconomics and Outcomes Research (ISPOR) and Society for Medical Decision Making (SMDM) task force elaborated upon these recommendations, providing detailed advice on conceptualizing the model, state transition models, discrete event simulations, dynamic transmission models, parameter estimation and uncertainty, and transparency and validation. The 79 recommendations are summarized in the first article of the series 44 .
We identified four articles 32,34,37,43 that present comprehensive frameworks of good modelling practice, with detailed justifications of the items covered and attributes of good practice. They include signalling or helper questions to facilitate the critical appraisal of   Caro et al. 32 to 66 questions in Bennett and Manuel 37 . The four frameworks cover similar territory, including items related to the problem concept, model structure, data sources and synthesis of the evidence, model uncertainty, consistency, transparency and validation ( Table 2). Two of the frameworks include sponsorship and conflicts of interest 32,37 .
In a qualitative study Chilcot et al. 27 performed in-depth interviews with 12 modellers from academic and commercial sectors, and model credibility emerged as the central concern of decisionmakers using models. Respondents agreed that developing an understanding of the clinical situation or disease process being investigated is paramount in ensuring model credibility, highlighting the importance of clinical input during the model development process 27 .

Model comparisons and modelling consortia
Published mathematical models addressing the same issue may reach contrasting conclusions. In this situation, careful comparison of the models may lead to a deeper understanding of the factors that drive outputs and conclusions. Ideally, the different modelling groups come together to explore the importance of differences in the type and structure of their models, and of the data used to parameterize them 19,45,46 . For example, several groups of modellers have investigated the impact of expanding access to antiretroviral therapy (ART) on new HIV infections. The HIV Modelling Consortium compared the predictions of several mathematical models simulating the same ART intervention programs to determine the extent to which models agree on the epidemiological impact of expanded ART 19 . The consortium concluded that although models vary substantially in structure, complexity, and parameter choices, all suggested that ART, at high levels of access and with high adherence, has the potential to substantially reduce new HIV infections in the population 19 . There was broad agreement regarding the short-term epidemiologic impact of ART scale-up, but more variation in longer-term projections and in the efficiency with which treatment can reduce new infections. The impact of ART on HIV incidence long-term is expected to be lower if models: (i) allow for heterogeneity in sexual risk behaviour; (ii) are age-structured; (iii) estimate a low proportion of HIV transmission from individuals not on ART with advanced disease (at low CD4 counts); (iv) are compared to what would be expected in the presence of HIV counselling and testing (compared to no counselling and testing); (v) assume relatively high infectiousness on ART; and (vi) consider drug resistance 19,47,48 .
Assessing mathematical modelling studies using the GRADE approach GRADE was conceived with the intention of creating a uniform system to assess a body of evidence to support guideline development in response to a confusing array of different systems in use at that time 49 . It has since been adopted by over 90 organisations, including WHO. GRADE addresses clinical management questions, including the impact of therapies and diagnostic strategies, diagnostic accuracy questions (i.e., the accuracy of a single diagnostic or screening test), the (cost-) effectiveness and safety of public health interventions, and questions about prognosis.
The GRADE approach encompasses two main considerations: the degree of certainty in the evidence used to support a decision and the strength of the recommendation. The degree of certainty, i.e., the confidence in or quality of a body of evidence, is rated as "high", "moderate", "low", or "very low" based on an assessment of five dimensions: study limitations (risk of bias), imprecision, inconsistency, indirectness, and publication bias. The initial assessment is based on the study design: RCTs start as high certainty and observational studies as low certainty. Based on the assessments of the five dimensions, RCTs may be down-rated and observational studies up-or down-rated. Judgment is required when assessing the certainty of the evidence, taking into account the number of studies of higher and lower quality and the relative importance of the different dimensions in a given context. The second consideration is the strength of the recommendation, which can be "strong" or "conditional", for or against an intervention or test, based on the balance of benefits and harms, certainty of the evidence, the relative values of persons affected by the intervention, resource considerations, acceptability and feasibility, among others 50 .
We believe that evidence from mathematical modelling studies could be assessed within the GRADE framework and included in the guideline development process. Specifically, guideline groups might include mathematical modelling studies as an additional study category, in addition to the categories of RCTs and observational studies currently defined in GRADE. The dimensions of indirectness, inconsistency, imprecision and publication bias are applicable to mathematical modelling studies, but criteria may need to be adapted. The concept of bias relates to results or inferences from empirical studies, including RCTs and observational studies 51,52 and is too narrow in the context of assessing mathematical modelling studies 53 . "Credibility", a term used by ISPOR 54 , may therefore be more appropriate for modelling studies than "risk of bias". The assessment of the credibility of a model is informed by a comprehensive quality framework and should cover the conceptualization of the problem, model structure, input data and their risk of bias, different dimensions of uncertainty, as well as transparency and validation ( Table 2). The framework should be tailored to each set of modelling studies by adding or omitting questions and developing review-specific guidance on how to assess each criterion. The certainty of the body of evidence from modelling studies can then be classified as high, moderate, low, or very low. In the evidence-to-decision framework a distinction should be made between observed outcomes from empirical studies and modelled outcomes from modelling studies (see the Meeting Report 4 for an example).

Conclusions and recommendations
Based on the discussions and presentations at the workshop in Geneva, the survey and rapid systematic review, we believe a number of conclusions can be formulated.
When is it appropriate to consider modelling studies as part of the evidence that supports a guideline? 1. The use of modelling studies should routinely be considered in the process of developing WHO guidelines. Findings of mathematical modelling studies can provide important evidence that may be highly relevant. Evidence from modelling studies should be considered specifically in the absence of empirical data directly addressing the question of interest, when modelling based on appropriate indirect evidence may be indicated. Examples for such situations include the evaluation of long-term effectiveness, and the impact of one or several interventions (comparative effectiveness), for example in the context of public health programmes where RCTs are rarely available.
2. Modelling may be more acceptable and more influential in situations where immediate action is called for, but little direct empirical evidence is available, and may arguably be more acceptable in public health than in clinical decision making. In these situations (for example, the HIV, Ebola, or Zika epidemics) funding is also likely to become available to support dedicated modelling studies.
3. The use of evidence from mathematical models should be carefully considered and there should be a systematic and transparent approach to identifying existing models that may be relevant, and to commissioning new models.
How should the credibility of mathematical modelling studies be assessed? 4. No single "one-size-fits-all" approach is appropriate to assess the quality of modelling studies. Existing frameworks and checklists may be adapted to a set of modelling studies by adding or omitting questions. In some situations, the approach will need to be developed de novo. 5. Additional expertise will typically be required in the systematic review groups or guideline development groups to appropriately assess the credibility of modelling studies and interpret their results. 6. The credibility of the models should not be evaluated only by modellers, and not only by modellers involved in the development of these models.
How can the GRADE approach be adapted to assess a body of evidence that includes the results of modelling and to formulate recommendations? 7. The inclusion of evidence from modelling studies into the GRADE process is possible and desirable, with relatively few adaptations. GRADE is simply rating the certainty of evidence to support a decision and any type of evidence can in principle be included.
8. The certainty of the evidence for modelling studies should be assessed and presented separately in summaries of the evidence (GRADE evidence profiles), and classified as high, moderate, low, or very low certainty.
9. The GRADE dimensions of certainty (imprecision, indirectness, inconsistency and publication bias) and the criteria defined for their assessment are also relevant to modelling studies.
10. For modelling studies, the concept of the 'credibility' of the model, which takes the structure of the model, input data, dimensions of uncertainty, as well as transparency and validation into account, is more appropriate than 'study limitations' or 'risk of bias'. 11. When summarizing the evidence, a distinction should be made between observed and modelled outcomes.
12. We propose that within the GRADE system, modelling studies start at low certainty. It should then be possible to increase or decrease the certainty of modelling studies based on a set of criteria. The development of these criteria was beyond the scope of this article; a GRADE working group is addressing this issue (http://www.gradeworkinggroup.org/).
We look forward to discussing these recommendations with experts and stakeholders and to developing exact procedures and criteria for the assessment of modelling studies and their inclusion in the GRADE process.
Competing interests Susan L. Norris is a member of the GRADE working group. No other competing interests were disclosed.

Grant information
The work reported in this article and the expert consultation meeting in Geneva, Switzerland, were funded by the UNICEF/ UNDP/World Bank/WHO Special Programme for Research and Training in Tropical Diseases (WHO/TDR).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
argue that evidence from modeling studies should be included in the Grading of Recommendations Assessment, Development and Evaluation (GRADE) process, and make specific recommendations to that effect. The authors argue that model credibility is more appropriate than risk of bias for evaluating strength of evidence generated by modeling studies. The paper is based on discussions and findings from a meeting of modeling experts in Geneva in 2016; the authors were also participants in the meeting.
The paper lays out a structured argument for incorporating modeling studies into the evidence base, particularly for formulating WHO recommendations related to treatment of HIV. The authors start by providing a review of how models are used in various fields, with suggestions about how they can inform guideline development. They address the question of what constitutes a modeling study. A comprehensive accounting of published literature on assessment of models is provided. Finally, they give recommendations for how models can be evaluated using the GRADE approach, with specific conclusions about important issues such as when modeling studies should be used as part of the evidence base and how their credibility should be assessed.
Given the sweeping variety of models used in published studies about HIV treatments, policies and interventions, the authors are to be applauded for putting forward a framework for having this conversation. It will promote broader understanding of how models work and how they can be most optimally used for informing treatment guidelines.
At the crux of their argument is the claim that evidence generated from models should be judged in terms of model credibility rather than on risk of bias. This argument raises several important issues. First, what constitutes a mathematical model? If a model is to be evaluated on its credibility, we need a definition to work from. Second, what kinds of output from models should be considered evidence, and how should the quality of that evidence be judged and ultimately weighed against or combined with evidence generated from randomized trials and observational cohort data?

What is a mathematical model?
According to the authors, mathematical modeling is "a mathematical framework representing variables and their interrelationships to describe observed phenomena or predict future events." On its face this is surely true, but for the purpose of understanding whether and how models should add to the evidence base, it's too broad. This definition covers a vast assortment of mathematical models, ranging from validated descriptions of natural phenomena (where the mathematical relationships are known and directly observable) to representations of progression from HIV infection to death (comprising known and unknown mathematical components, many of which cannot be directly observed).

Consider three examples for illustration:
The mathematical representation of radioactive decay is known and can be written down explicitly. The model enables accurate and replicable predictions of future observations. The mathematical model for absorption of a specific drug is typically not known, but empirical studies have shown that it is possible to approximate the systematic variation using nonlinear equations. These models incorporate known information about physiology and properties of a specific drug, but are necessarily oversimplified representations of drug absorption because there are unobservable characteristics of individuals that affect absorption. The models can be used to make reliable predictions on average, but require unexplained variation to be reflected in terms of prediction intervals. Now consider a model of the population dynamics of HIV infection and disease progression. This 3. Now consider a model of the population dynamics of HIV infection and disease progression. This process also follows a mathematical model, but the model itself is highly complex. Unlike radioactive decay or rate of drug absorption, the mathematical representations of several components of the underlying processes are essentially unknown. Moreover, much of the data needed to inform the models are either unobserved (e.g. timing of HIV infection) or only sparsely observed (e.g. individual-level viral load).
All of these are mathematical models, but definitions must distinguish between them. Otherwise there is an implied equivalence that lends more credibility than is deserved to models that are heavily reliant on unverified assumptions about the mathematical structure underlying the dynamic system being modeled. A more systematic classification of model types would therefore be helpful.
While the authors' definition of mathematical model is overly broad, the definition of statistical model, used to contrast with mathematical models, is too narrow. A statistical model is used to characterize sources of variation in observed data. It is based on a probabilistic representation of the data generating mechanism, which is itself a mathematical model. Theory and methods of statistical inference provides a rigorous and transparent set of techniques for parameter estimation, prediction of future outcomes, extrapolation (e.g. for causal inference), and uncertainty quantification. The last of these, uncertainty quantification, is a critical and frequently missing component of predictions based on mathematical models.
For the purposes of generating evidence for WHO recommendations, the main difference between a mathematical model and a statistical model is that mathematical models tend to have broader scope and incorporate higher dimensions of complexity, but rely more heavily on assumptions about underlying mathematical structure than on individual-level data. Statistical models tend to have less mathematical complexity and more narrow scope, and are typically fitted to a single (possibly large) set of observed individual-level data drawn from the target population(s) of interest. A mathematical structure underlies both statistical and mathematical models, and both can be used for prediction of future outcomes and for causal policy comparisons.

Should models be judged on 'risk of bias'?
The authors propose that evidence generated from mathematical models should be weighted more heavily toward model credibility than risk of bias.
Many mathematical models are over-parameterized relative to the amount of data used to fit them; hence multiple configurations of parameter values can be lead to very similar predictions. Mathematical models are typically calibrated to observed population-level data (e.g. annual HIV incidence rate for the target population), but the formal rules for doing this seem to vary across application.
For many consumers of model-based outputs, this is a significant methodologic concern that goes directly to the question of credibility. If multiple model configurations can generate similar predictions, which configuration is the most credible one? It seems reasonable that model-based outcomes such as 10-year predictions of HIV incidence need to be evaluated on their own terms. If coupled with a formal process for back-checking or recalibrating existing models this would surely add value, and would possibly strengthen model identifiability (i.e., provide evidence in favor of one set of model parameters over another).
A more general justification for incorporating risk of bias into model evaluation can be found in Coveney et al (page 4), who provide a general rubric for assessing quality of scientific evidence in the age of big 1 al (page 4), who provide a general rubric for assessing quality of scientific evidence in the age of big data, emphasizing 'acceptance of the theory based on concordance between the predictions and the measurements.' Model calibrations at the time of model fitting partially fulfill this objective, but post-hoc evaluation of model predictions must play an important role in establishing credibility.
The process of combining and comparing models is highly innovative and likely to have a positive impact on whether the results will be well received. This kind of cooperation and collaboration, exemplified recently by the Modelling Consortium, is perhaps unique to the mathematical modeling community. Evidence generated by these kinds of activities can form an important part of the evidence base.

Summary
The authors have provided a thorough case for including results from mathematical modeling into the formal evidence base used for making health recommendations, especially as they relate to HIV. The paper is based on findings from a recent conference and a comprehensive survey of extant literature.
The main critiques are that the definition of mathematical model is far too broad, and that bias (or risk of bias) needs to be incorporated into the evaluation criteria. Formal methods for uncertainty quantification are critical as well.
Mathematical models are prevalent and influential in the HIV literature; hence a discussion about whether and how to place their findings in the broader evidence base is needed and welcome. This paper provides a necessary starting point.

Are the conclusions drawn balanced and justified on the basis of the presented arguments? Yes
No competing interests were disclosed.

Competing Interests:
Reviewer Expertise: Biostatistics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. : Thank you very much for reviewing our paper and making such thoughtful Authors' response comments. We address your reservations one by one below. We have made changes in the text via tracked changes.
The authors propose to incorporate findings from mathematical modeling studies into the development of WHO guidelines and other processes related to evaluating and developing public health policies. They argue that evidence from modeling studies should be included in the Grading of Recommendations Assessment, Development and Evaluation (GRADE) process, and make specific recommendations to that effect. The authors argue that model credibility is more appropriate than risk of bias for evaluating strength of evidence generated by modeling studies. The paper is based on discussions and findings from a meeting of modeling experts in Geneva in 2016; the authors were also participants in the meeting.
: First, we would like to stress that our article is an opinion piece, and does not Authors' response reflect an official position of the World Health Organization or any other body. In the introduction we write: "In this paper, which reflects the opinions of the authors… ". Also, we tried to keep the recommendations fairly general, rather than specific and prescriptive. For example, we refrained from recommending a specific instrument to assess the quality of modeling studies. We conclude by saying that "We look forward to discussing these recommendations with experts and stakeholders and to developing exact procedures and criteria for the assessment of modelling studies and their inclusion in the GRADE process." The paper lays out a structured argument for incorporating modeling studies into the evidence base, particularly for formulating WHO recommendations related to treatment of HIV. The authors start by providing a review of how models are used in various fields, with suggestions about how they can inform guideline development. They address the question of what constitutes a modeling study. A comprehensive accounting of published literature on assessment of models is provided. Finally, they give recommendations for how models can be evaluated using the GRADE approach, with specific conclusions about important issues such as when modeling studies should be used as part of the evidence base and how their credibility should be assessed.
: Thank you, this is a nice outline of the paper. Authors' response Given the sweeping variety of models used in published studies about HIV treatments, policies and interventions, the authors are to be applauded for putting forward a framework for having this conversation. It will promote broader understanding of how models work and how they can be most optimally used for informing treatment guidelines.
: Thank you very much. Authors' response 1.

3.
At the crux of their argument is the claim that evidence generated from models should be judged in terms of model credibility rather than on risk of bias. This argument raises several important issues. First, what constitutes a mathematical model? If a model is to be evaluated on its credibility, we need a definition to work from. Second, what kinds of output from models should be considered evidence, and how should the quality of that evidence be judged and ultimately weighed against or combined with evidence generated from randomized trials and observational cohort data?
: We agree that these are central issues. Regarding the outputs from models that Authors' response should be considered evidence, please note that we distinguish between mathematical models and . The latter address well defined questions and outcomes, such as the modelling studies impact of HIV testing and immediate antiretroviral therapy on HIV incidence or the impact of different screening strategies on the incidence of cervical cancer. In other words, the modeling outputs that constitute relevant evidence will depend on the question addressed in the modeling studies. In the revised version we write.
GRADE provides a well-defined framework for weighing evidence from randomized trials and observational studies, as discussed in the section on "Assessing mathematical modelling studies using the GRADE approach". Of note, randomized trials and observational studies are assessed separately. In general, guideline development groups will focus on randomized evidence if such evidence is available from several trial, and only consider observational studies in the absence of substantial randomized evidence. Similarly, evidence from mathematical modelling studies will be considered primarily if other studies cannot answer the question. Statistically combining evidence from different study types is not foreseen in GRADE, and beyond the scope of our article.

What is a mathematical model?
According to the authors, mathematical modeling is "a mathematical framework representing variables and their interrelationships to describe observed phenomena or predict future events." On its face this is surely true, but for the purpose of understanding whether and how models should add to the evidence base, it's too broad. This definition covers a vast assortment of mathematical models, ranging from validated descriptions of natural phenomena (where the mathematical relationships are known and directly observable) to representations of progression from HIV infection to death (comprising known and unknown mathematical components, many of which cannot be directly observed).

Consider three examples for illustration:
The mathematical representation of radioactive decay is known and can be written down explicitly. The model enables accurate and replicable predictions of future observations. The mathematical model for absorption of a specific drug is typically not known, but empirical studies have shown that it is possible to approximate the systematic variation using nonlinear equations. These models incorporate known information about physiology and properties of a specific drug, but are necessarily oversimplified representations of drug absorption because there are unobservable characteristics of individuals that affect absorption. The models can be used to make reliable predictions on average, but require unexplained variation to be reflected in terms of prediction intervals.

Now consider a model of the population dynamics of HIV infection and disease
3. Now consider a model of the population dynamics of HIV infection and disease progression. This process also follows a mathematical model, but the model itself is highly complex. Unlike radioactive decay or rate of drug absorption, the mathematical representations of several components of the underlying processes are essentially unknown. Moreover, much of the data needed to inform the models are either unobserved (e.g. timing of HIV infection) or only sparsely observed (e.g. individual-level viral load). All of these are mathematical models, but definitions must distinguish between them. Otherwise there is an implied equivalence that lends more credibility than is deserved to models that are heavily reliant on unverified assumptions about the mathematical structure underlying the dynamic system being modeled. A more systematic classification of model types would therefore be helpful.
: Thank you for these three examples. We agree that the definition by Pieter Authors' response Eykhoff is fairly broad and covers all three categories. However, we feel it is clear from the title and text of our paper that we are primarily concerned with models of the second and third category, i.e. more complex mathematical models that are relevant to WHO guidelines. Within these categories, the level of abstraction and complexity of models and their credibility will of course vary (see also examples in Table 1).
We have now made explicit the distinction that we make between and mathematical models at the beginning of the section, "What is a mathematical modelling mathematical modelling studies study?" (page 4):

A broad definition of a mathematical model is a "mathematical framework representing variables and their interrelationships to describe observed phenomena or predict future events". We make a distinction between a mathematical model and mathematical modelling studies, which we define as studies that address defined research questions using mathematical models with a considerable degree of complexity and abstraction.
At the Geneva workshop participants discussed different types of mathematical models in detail, based on a presentation by one of the authors (CA) on "The anatomy of mathematical modelling studies". The workshop report and slides and be found at . http://apps.who.int/iris/bitstream/10665/258987/1/WHO-HIS-IER-REK-2017.2-eng.pdf Unfortunately, the link to the report and slides was incorrect in the F1000research paper. CA discussed model dichotomies, based on the book by Ben Bolker (Ecological Models and Data in R, 2008, Princeton University Press), and illustrated these using case studies from the Ebola crisis -see copy of one of his slides at the end of this response. In the discussion, workshop participants argued that guideline groups will often not be able to differentiate between different model dichotomies, and that this is not essential: guideline groups "just have to know what information models can provide and what value can be placed in that information." However, we recommend that experts in mathematical modelling should support guideline groups (see recommendation 5 on page 9).
We agree with the referee that guideline developers should carefully assess the credibility of models, and that models that "are heavily reliant on unverified assumptions about the mathematical structure underlying the dynamic system" are not credible. Our review of the methodological literature (see Table 2 in the paper) showed that the published frameworks of good modelling practice consistently emphasize the importance of the rationale for the model structure, the structural assumptions and uncertainty, the model transparency and validation etc. See also our recommendations 4, 5 and 6 on p 9.
We added a more explicit reference and the correct link to the Workshop report, (page 3, last line):

A detailed workshop report is available from WHO .
We also expanded the section, "What is a mathematical modelling study" to clarify our view on the need for classifying mathematical modelling studies (page 4, last paragraph):

Workshop participants discussed whether it might be helpful for guideline groups to classify mathematical models in terms of their scope (for example descriptive versus predictive) or technical approach (for example static versus dynamic) . Discussants argued that a good understanding of what information models can provide and what level of confidence can be placed in that information was more important than a taxonomy of models .
While the authors' definition of mathematical model is overly broad, the definition of statistical model, used to contrast with mathematical models, is too narrow. A statistical model is used to characterize sources of variation in observed data. It is based on a probabilistic representation of the data generating mechanism, which is itself a mathematical model. Theory and methods of statistical inference provides a rigorous and transparent set of techniques for parameter estimation, prediction of future outcomes, extrapolation (e.g. for causal inference), and uncertainty quantification. The last of these, uncertainty quantification, is a critical and frequently missing component of predictions based on mathematical models.
: We agree with the reviewer's comment about assessing uncertainty in Authors' response mathematical models and state this explicitly (page 5, paragraph 2).
For the purposes of generating evidence for WHO recommendations, the main difference between a mathematical model and a statistical model is that mathematical models tend to have broader scope and incorporate higher dimensions of complexity, but rely more heavily on assumptions about underlying mathematical structure than on individual-level data. Statistical models tend to have less mathematical complexity and more narrow scope, and are typically fitted to a single (possibly large) set of observed individual-level data drawn from the target population(s) of interest. A mathematical structure underlies both statistical and mathematical models, and both can be used for prediction of future outcomes and for causal policy comparisons.
: We are grateful to the referee for this insightful and well-phrased comment Authors' response about the relevance of the terms 'statistical modelling' and 'mathematical modelling' to WHO guidelines. We have taken the liberty of paraphrasing the comment to revise this section (page 4) as follows:

Should models be judged on 'risk of bias'?
The authors propose that evidence generated from mathematical models should be weighted more heavily toward model credibility than risk of bias.
: Yes, we believe that the concept of model credibility is more useful than the Authors' response more narrow concept of risk of bias (RoB). However, we think there is a mis-understanding here: the assessment RoB of specific studies also has a role.
The RoB concept is widely used in the context of randomized controlled trials and observational studies that aim to make causal inference, and dedicated "RoB tools" have been developed to assess the risk of bias of studies included in systematic reviews (see references 1,2 below and ). These tools are based on relatively few well-defined biases. In the case of www.riskofbias.info randomized trials they include selection bias, performance bias, detection bias, attrition bias and reporting bias (1).
In the context of mathematical modelling studies, the risk of bias of empirical studies contributing parameter estimates is important and should be considered, for example in sensitivity analyses. On the other hand, many other and additional aspects are important when assessing the trustworthiness or credibility of mathematical models. These aspects are listed in Table 2, based on a review of published frameworks developed to assess good modelling practice. Please note that we use the term credibility as applied by the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) to assessment of studies for decision making (3).
These frameworks include assessments of the quality of the data used to parameterize a model. For example, Bennett and Manuel (4) and Philips et al (5) include several questions to that effect: Where choices have been made between data sources, are these justified appropriately? Where data from different sources are pooled, is this done in a way that the uncertainty relating to their precision and possible heterogeneity is adequately reflected? Has the quality of the data been assessed appropriately? The questionnaire proposed by Caro et al (6) asks Are the data used in populating the model suitable for your decision problem? All things considered, do you agree with the values used for the inputs?
Similarly, the framework of Ramos and colleagues (7) includes the following questions: Have transition probabilities and intervention effects been derived from representative data sources for the decision problem?
Have parameters relating to the effectiveness of interventions derived from observational studies been controlled for confounding?
We have clarified our position and the role of RoB assessments as follows on page 8: assessment of the credibility of a model is informed by a comprehensive quality  framework and should cover the conceptualization of the problem, model structure, the  input data and their risk of bias, different dimensions of uncertainty, as well as  transparency and validation (Table 2).

The concept of bias relates to results or inferences from empirical studies, including randomized controlled trials and observational studies and is too narrow in the context of assessing mathematical modelling studies. "Credibility", a term used by ISPOR, may therefore be more appropriate for modelling studies than "risk of bias". The
Many mathematical models are over-parameterized relative to the amount of data used to fit them; hence multiple configurations of parameter values can be lead to very similar predictions. Mathematical models are typically calibrated to observed population-level data (e.g. annual HIV incidence rate for the target population), but the formal rules for doing this seem to vary across application.
: We agree -model concept, structure and parsimony are important elements Authors' response when evaluating the credibility of mathematical models. Validation and predictive validity are also very important -again see Table 2. For many consumers of model-based outputs, this is a significant methodologic concern that goes directly to the question of credibility. If multiple model configurations can generate similar predictions, which configuration is the most credible one? It seems reasonable that model-based outcomes such as 10-year predictions of HIV incidence need to be evaluated on their own terms. If coupled with a formal process for back-checking or recalibrating existing models this would surely add value, and would possibly strengthen model identifiability (i.e., provide evidence in favor of one set of model parameters over another).
: We agree and, again, believe that these issues are covered by the frameworks Authors' response we present in Table 2. Coveney et al (page 4), who provide a general rubric for assessing quality of scientific evidence in the age of big data, emphasizing 'acceptance of the theory based on concordance between the predictions and the measurements.' Model calibrations at the time of model fitting partially fulfill this objective, but post-hoc evaluation of model predictions must play an important role in establishing credibility.
: The timely piece by Coveney and Dougherty is really a critique of "'blind' big Authors' response data projects" and a plea for "the elucidation of the multiscale and stochastic processes controlling the behaviour of complex systems, including those of life, medicine and healthcare." We could not agree more and argue that insights from the latter (mathematical models) should inform the development of WHO guidelines.
The process of combining and comparing models is highly innovative and likely to have a positive impact on whether the results will be well received. This kind of cooperation and collaboration, exemplified recently by the Modelling Consortium, is perhaps unique to the mathematical modeling community. Evidence generated by these kinds of activities can form an important part of the evidence base. 1.

Wilma A. Stolk
Erasmus MC, Department of Public Health, University Medical Center Rotterdam, Rotterdam, The Netherlands In this opinion article, the authors discuss when and how to incorporate the results of modelling studies into WHO guidelines, by addressing three questions: (1) When is it appropriate to consider modelling studies as part of the evidence that supports a guideline? (2) How should the quality and risk of bias in mathematical modelling studies be assessed? (3) How can the GRADE approach be adapted to assess the certainty of a body of evidence that includes the results of modelling and to formulate recommendations? Based on findings from a web-based expert survey, a rapid literature review to identify criteria for assessing the "quality" of mathematical modelling studies, and on discussions and presentations at a workshop on the topic that was held April 2016 in Geneva, the authors conclude that modelling studies should indeed routinely be considered in the process of developing WHO guidelines, particularly in the evaluation of public health programmes, long-term effectiveness or comparative effectiveness. As for other types of evidence taken into consideration, there should be a systematic and transparent approach to identifying existing models that may be relevant and the quality and credibility of models should be systematically assessed. Relatively few adaptations are needed in the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach to rate the certainty of a body of evidence and to produce information that is used by guideline panels to formulate recommendations, based on the balance of benefits and harms and other considerations.

MINOR COMMENTS:
Recommendation 4 is "No single 'one-size-fits-all' approach is appropriate to assess the quality of modelling studies. Existing frameworks and checklists may be adapted to a set of modelling studies by adding or omitting questions. In some situations, the approach will need to be developed ." I'd prefer to turn it around: based on existing frameworks and checklists, de novo generic criteria can be developed to assess the quality of modelling studies, although -depending on the situation -questions may have to be added or omitted. I am not convinced that in some situations a completely new approach is needed, and this would also not be advisable. The authors should either delete the last statement, or explain under which circumstances such a new approach is needed, ideally illustrated with an example.
Recommendation 8 is "The certainty of the evidence for modelling studies should be assessed and presented separately in summaries of the evidence (GRADE evidence profiles), and classified as high, moderate, low, or very low certainty." In the text, the authors state that RCTs start as high certainty and observational studies as low certainty, although this certainty score may be up-or down-rated based on detailed assessment of five dimensions. Is it possible to give an indication of where modelling studies would start, with a justification? If not, can the authors describe factors to be considered when determining the start class?
The questionnaire of the online survey on the use of mathematical modelling in guidelines for public health decision making is included as Figure S1, which combines a series of screen shots. The quality of this figure is poor and I recommend to include the questionnaire as a text document.

Is the topic of the opinion article discussed accurately in the context of the current literature? Yes
Are all factual statements correct and adequately supported by citations? Yes 1.

1.
modelling studies by adding or omitting questions. In some situations, the approach will need to be developed ." I'd prefer to turn it around: based on existing frameworks de novo and checklists, generic criteria can be developed to assess the quality of modelling studies, although -depending on the situation -questions may have to be added or omitted. I am not convinced that in some situations a completely new approach is needed, and this would also not be advisable. The authors should either delete the last statement, or explain under which circumstances such a new approach is needed, ideally illustrated with an example.
: Thank you. We agree and have deleted the last statement on page 9. Authors' response Recommendation 8 is "The certainty of the evidence for modelling studies should be assessed and presented separately in summaries of the evidence (GRADE evidence profiles), and classified as high, moderate, low, or very low certainty." In the text, the authors state that RCTs start as high certainty and observational studies as low certainty, although this certainty score may be up-or down-rated based on detailed assessment of five dimensions. Is it possible to give an indication of where modelling studies would start, with a justification? If not, can the authors describe factors to be considered when determining the start class? : Thank you, we have addressed this issue as follows on page 9: Authors' response "We propose that within the GRADE system, modelling studies start at low certainty, and it is then possible to increase or decrease the certainty of modelling studies based on a set of criteria. The development of these criteria was beyond the scope of this article; a GRADE working group is addressing this issue ( )." http://www.gradeworkinggroup.org/ The questionnaire of the online survey on the use of mathematical modelling in guidelines for public health decision making is included as Figure S1, which combines a series of screen shots. The quality of this figure is poor and I recommend to include the questionnaire as a text document.
: Thank you. There is no text document for the survey but we have enlarged the Authors' response screen shots to increase their readability (pages 22-27).
No competing interests were disclosed.

Competing Interests:
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com