The hierarchy of evidence: Levels and grades of recommendation

E vidence-based medicine requires the integration of evidence, that is, some studies are better suited than others, clinical judgment, recommendations from the best to answer a question of therapy, for example, and may available evidence and the patient’s values. The more accurately represent the “truth”. The ability of a study “best available evidence” is used quite frequently and in to do this rests on two main contributing factors, the study order to fully understand this one needs to have a clear design and the study quality. In this context we will focus knowledge of the hierarchy of evidence and how the for the most part on those studies addressing therapy as integration of this evidence can be used to formulate a this is generally the most common study in the orthopedic grade of recommendation. It is necessary to place the surgical literature. available literature into a hierarchy as this allows for a clearer communication when discussing studies, both in Available therapeutic literature can be broadly categorized day-to-day activities such as teaching rounds or discussions as those studies of an observational nature and those with colleagues, but especially when conducting a studies that have a randomized experimental design. The systematic review so as to establish a recommendation for reason that studies are placed into a hierarchy is that those This necessarily requires an understanding of at the top are considered the “best evidence”. In the case both study design and quality as well as other aspects which of therapeutic trials this is the randomized controlled trial can make placing the study within the hierarchy difficult. (RCT) and meta-analyses of RCTs. RCTs have within them, Another confounder is that there are a number of systems by the nature of randomization, an ability to help control that can be used to place a study into a hierarchy and bias. Bias (of which there are many types) can confound depending on the system a study can be placed at a the outcome of a study such that the study may over or different “level”. However, in general the different systems underestimate what the true treatment effect is. rate high quality evidence as “1” or “high” and low quality Randomization is able to achieve this by not only evidence as “4 or 5” or “low”. Recently, some orthopedic controlling for known prognostic variables but also and journals have adopted the reporting of levels of evidence more importantly controlling for the unknown prognostic with the individual studies and in many cases the grading variables within a sample population. system has been adopted from the Oxford centre for act of randomization should be able to create an equal evidence-based medicine system. Rather than refer to any distribution of prognostic variables (both known and particular system we will speak in general terms of those unknown) within both the control and treatment groups studies deemed to be high-level evidence and relate this within a study. This bias-controlling measure helps attain to those of lesser quality. a more accurate estimation of the truth. Those studies of a more observational nature have within their designs areas practice.

of bias not present in the randomized trial.
Meta-analyses of randomized controlled trials in effect use the data from individual RCTs and statistically pool it. 5,9 This effectively increases the number of patients that the data was obtained from, thereby increasing the effective sample size. The major drawback to this pooling is that it is dependent on the quality of RCTs that were used. 9 For example, if three RCTs are in favor of a treatment and two are not or if the results show wide variation between the estimates of treatment effect with large confidence intervals (i.e. the precision of the point estimate of the treatment effect is poor) between different RCTs then there is some variable (or variables) causing inconsistent results between studies (one variable may in fact be differences in study quality among others) and the quality of usable results from statistical pooling will be poor. However, if five methodologically well done RCTs are used, all of which favor a treatment and have precise measures of treatment effect (i.e., narrow confidence intervals) then the data obtained from statistical pooling is much more believable. The former studies can be termed heterogeneous and the Petrisor BA, et al.: The hierarchy of evidence They also can allow for analysis of multiple prognostic factors and relationships within these factors to help determine potential associations to the outcome of choice (in this case nonunion).
In contrast to the case-control and slightly higher on the levels of evidence hierarchy, 3 the cohort study is usually done in a prospective fashion (although it can be done latter homogeneous. 9 retrospectively) and usually follows two groups of patients. One of these groups has a risk factor or prognostic factor In contrast to this, the lowest level on the hierarchy (aside of interest and the other does not. The groups are followed from expert opinion) is the case report and case series. 3 to see what the rate of development of a disease or specific These are usually retrospective in nature and have no outcome is in those with the risk factor as compared to comparison group. They are able to provide outcomes for those without. Given that this is usually done prospectively only one subgroup of the population (those with the it falls higher within the hierarchy as data collection and intervention). There is the potential for the introduction of follow-up can be more closely monitored and attempts can bias especially if there is incomplete data collection or be made to make them as complete and accurate as follow-up which may happen with retrospective study possible. This type of study design can be very powerful in designs. Also, these studies are usually based on a single some instances. For example, if one wanted to see what surgeon's or center's experience which may raise doubts the effect of smoking was on nonunion rates, it wouldn't as to the generalizability of the results. Even with these be ethical or generally possible to randomize patients with drawbacks, this study design may be useful in many ways.
fractures into those who are going to smoke and those They can be used effectively for hypothesis generation as who are not. However, by following two groups of patients, well as potentially providing information on rare disease smokers and non-smokers with tibial fractures for instance, entities or complications that may be associated with certain one can then document nonunion rates between the two procedures or implants. For example, reporting of infection groups. In this case, because of its prospective design, rates following a large series of tibial fractures treated with groups can at least be matched to try and limit the bias of a reamed intramedullary nail 10 or the rate of hardware at least those prognostic variables that are known, such as failure of a particular implant to name a few. age, fracture pattern or treatment type to name a few.
The next level of study is the case-control. The case-control It is important to understand distinctions between study starts with a group who has had an outcome of interest designs. Some investigators argue that well-constructed and looks back at other similar individuals to see what observational studies lead to similar conclusions as RCTs. 11 factors may have been present in the study group and may However, others suggest that observational studies have a be associated with the outcome. Let us take a hypothetical more significant potential to over or underestimate example. Those patients who have a nonunion following treatment effects. Indeed, examples are present in both a tibial shaft fracture treated with an intramedullary nail. If medical and orthopedic surgical specialties showing that one wanted to see what prognostic factors may have discrepant results can be found between randomized and contributed to this, a group that was matched for the known prognostic variables such as age, treatment type, fracture pattern etc. could then be compared and an analysis of other prognostic variables such as smoking, nonsteroidal anti-inflammatory use or fracture pattern could be done to see if there was any association between these and the development of nonunion. The drawback to this design is that there may be unknown or as yet unidentified risk factors that would not be able to be analyzed. However, in those that are known, the strength of association may be determined and given in the form of odds ratios or sometimes relative risks. Other strengths of this study design are that they are usually less expensive to implement and can allow for a quicker "answer" to a specific question. nonrandomized trials. 6,8,12 One recent nonsurgical example of this is hormone replacement therapy in postmenopausal women. 13,14 Previous observational studies suggested that there was a significant effect of hormone replacement therapy on bone density with a favorable risk profile. However, a recent large RCT found an increasing incidence of detrimental cardiac and other adverse events in those undergoing hormone replacement therapy, risks which had heretofore been underestimated by observational studies. 13,14 As a result of this the management of postmenopausal osteoporosis has undergone a shift in firstline therapy. 13 In the orthopedic literature it has been suggested that when assessing randomized and nonrandomized trials using studies of arthroplasty vs.
internal fixation, nonrandomized studies overestimated the risk of mortality following arthroplasty and underestimated the risk of revision surgery with arthroplasty. 8 Interestingly, they also found that in those nonrandomized studies that had similar results to randomized studies, patient age, gender and fracture displacement were controlled for between groups. 8 This illustrates the importance of both controlling for variables and for randomization which will control for potentially important but as yet unknown variables. Petrisor BA, et al.: The hierarchy of evidence to improve the quality of reporting. 18,19 Randomization As randomization is the key to balancing prognostic variables, it is first necessary to determine how it was done. The most important concepts of randomization are that allocation is concealed and that the allocation is truly random. If it is known to which group a patient will be randomized it may be possible to potentially influence their allocation. Examples of this would include randomizing by chart number, birthdates or odd or even days. This Thus the type of study design used places the study broadly necessarily introduces a selection bias which negates the into a hierarchy of evidence from the case series up to the effect of randomization. This makes concealment of randomized controlled trial. There is also, however, an allocation a vital component of successful randomization. internal hierarchy within the overall levels of evidence and Allocation can be concealed by having offsite that is usually based on the study methodology and overall randomization centers, web-based or phone-based randomization.

Blinding
In surgical trials blinding is obviously not possible for some Concepts of study methodology are important to consider aspects of the trial. It is not possible (or ethical) to blind a when placing a study into the levels of evidence. There surgeon, nor is it usually possible to blind a patient to a are some that advocate dividing the hierarchy levels into particular treatment. However, there are other aspects of sub-levels based in part on study methodology, while others a trial where blinding can play a role. For instance, it is suggest that poor methodology will take a study down a possible to blind outcome assessors, the data analysts and level. 2,3 For instance, one RCT could be considered a very potentially other outcomes' adjudicators. Thus it is high-level study while another RCT because of important to understand who is doing the data collecting methodological limitations may be considered lower. Do and ask, are they independent and were they blinded to these then fall into separate categories or into sub-the treatment received? If not, possible influences (either categories of the same level? It depends on the level of subconscious or not) on the patient and subsequent results evidence system being used. The real point is that these can happen. systems acknowledge a difference in the quality and thus the "level" of these studies. In many instances however, Follow-up the methodological limitations that will take a study down The number lost to follow-up is very important to know as a level are not clearly defined and it is left to the individual clearly this can affect the estimate of treatment effect. While to attempt to correctly categorize the study based on them.
some argue that only a 0% loss to follow-up fully ensures The rigor with which a study is conducted plays a role in benefits of randomization, 20 in general, the validity of a how believable the results may be. 15,16 Not all case-control, study may be threatened if more than 20% of patients are cohort or randomized studies are done to the same lost to follow-up. 5 Calculations of results should include a standards and thus if done multiple times, may have worst case scenario, that is, those that are lost to follow-up quality.
different results, both due to chance or due to confounding variables and biases. Briefly, if we take the example of a RCT one needs to look at all aspects of the methodology to see how rigorously the study was conducted. We present three examples of how different aspects of methodology may affect the results of a trial. While it is important to look closely at the methods section of a paper to see how the study was conducted, it must be remembered that if something has not been reported as being done (such as the method of randomization) it does not necessarily mean it was not done. 17 This illustrates the importance of tools such as the "Consolidated Standards of Reporting" (CONSORT) statement for reporting trials which attempts are considered to have the worst outcome in the treatment group and those lost to follow-up in the control group having the best outcome. If there is still a treatment effect seen between the groups then this makes a more compelling argument for the treatment effect observed being a valid estimate of the truth. 21 Scales have been devised that can rate a study based on its methodology and assign a score. 22 This does not always need to be done in daily practice however. Knowledge of the different areas of methodology though may affect interpretation of the results and allow for the recognition of a "strong" study which may then provide more compelling and "believable" results as compared to a "weaker" study.

Grades of recommendation
When truly does assessing the quality of a study in relation to the levels of evidence matter? It matters when a grade of recommendation is being developed. A very important concept is that a single high-level therapeutic study (in our Petrisor BA, et al.: The hierarchy of evidence categories. Either "do it" or "don't do it" and "probably do it" or "probably don't do it". The grades of "do it" or "don't do it" are defined as "a judgment that most wellinformed people would make". The grades of "probably do it" or "probably don't do it" are defined as "a judgment that the majority of well-informed people would make but a substantial minority would not". 2 case) does not imply a high grade of recommendation for Thus one can see that a grade of recommendation in treatment. A grade of recommendation can only be contradistinction to a level of recommendation is made developed after a thorough systematic review of the based on the above four criteria. Inherent in the above literature and in many cases discussions with content criteria are a thorough review of the literature and a grading experts. 2,4,23 When developing grades of recommendation, of the studies through knowledge of study design and it becomes important to place weights on studies with more methodology. Evidence-based medicine is touted as being weight being given to studies of high quality and high on a decision-making based on the composite of the the hierarchy and less so to lesser quality studies. 2 triumvirate of clinical experience, the best available evidence and patient values. One can see that knowledge The GRADE working group suggests a system for grading of the levels of evidence, the pros and cons of different the quality of evidence obtained from a thorough study designs and how study methodology can affect the systematic review [ Table 1]. This should be done for all placement of a study within the hierarchy encompasses the outcomes of interest as well as all the potential harms.
one aspect of this. The development of grades of They suggest that once the total evidence has been graded recommendation based on the GRADE working group then recommendations for treatment can be made.
system gives one the tools to convey the best available evidence to the patient as well as help the literature guide The GRADE working group suggests that when making a the busy clinician. Also, different harms and benefits of recommendation for treatment four areas should be various treatments are given different value judgments by considered: 1) What are the benefits vs. harms? Are there individual patients. Discussions with patients about what clear benefits to an intervention or are there more harms is important to them, mixed with surgical experience and than good? 2) The quality of the evidence, 3) Are there "what works in my hands" helps round out the decisionmodifying factors affecting the clinical setting such as the making when developing a treatment plan. proximity of qualified persons able to carry out the intervention? and 4) What is the baseline risk for the