Outcome measurements in orthopedic

The choice of outcome measure in orthopedic clinical research studies is paramount. The primary outcome measure for a study has several implications for the design and conduct of the study. These include: 1) sample size determination, 2) internal validity, 3) compliance and 4) cost. A thorough knowledge of outcome measures in orthopedic research is paramount to the conduct of a quality study. The decision to choose a continuous versus dichotomous outcome has important implications for sample size. However, regardless of the type of outcome, investigators should always use the most ‘patient-important’ outcome and limit bias in its determination.


TYPES OF OUTCOME MEASURES
McMaster Osteoarthritis Index.

nvestigators have a variety of options when considering
The most commonly used generic instrument in the outcomes for their studies. Regardless of the specific orthopedic surgical literature is the Short Form-36 (SF outcome measure used, outcomes should be "patient-36). The SF-36 is a multi-purpose, short-form health survey important" and as objective as possible. Mortality is one consisting of 36 questions. 2,3 The SF-36 has proven useful example of an important and objective outcome measure.
in surveys of general and specific populations, comparing However, the majority of orthopedic research focuses upon the relative burden of diseases and in differentiating the return to function or measures other than death. Thus, health benefits produced by a wide range of different investigators should be familiar with instruments that treatments. 2,3 The experience to date with the SF-36 has measure patient function or quality of life. Jackowski and been documented in nearly 4,000 publications; citations Guyatt 1 have summarized the key issues in the use of such for those published in 1988 through 2000 are documented measures [ Table 1]. One of the choices that investigators in a bibliography covering the SF-36 and other instruments face when trying to identify an appropriate measure is in the "SF" family of tools. 2,3 whether to use generic or disease-specific instruments to measure health status.
The SF-36 contains multi-function item scales to measure eight domains: physical function (10 items); role physical A generic instrument is one that measures general health (four items); bodily pain (two items); general health (five status inclusive of physical symptoms, function and items); vitality (four items); social functioning (two items); I emotional dimensions of health. An example of a generic instrument includes the Short Form-36. A disadvantage of generic instruments however, is that they may not be sensitive enough to be able to detect small but important changes. Disease-specific measures are tailored to inquire about the specific physical, mental and social aspects of health affected by a disease (e.g. arthritis). An example of a disease-specific instrument includes the Western Ontario role emotional (four items); and mental health (five items). The two summary measures of the SF-36 are the physical component summary and the mental component summary. The scores for the multi-function item scales and the summary measures of the SF-36 vary from zero to 100, with 100 being the best possible score and zero being the lowest possible score. The SF-36 takes less than 15min to complete. It can be self-administered or interviewadministered. The SF-36 is available in number languages.
To use the SF-36, permission must be obtained through Quality Metric (www.SF-36.org).
Utility or performance measures are a unique form of generic instrument that measure health status by quantifying wellness on a continuum anchored by death and optimum health. Assessment of health utility is rooted Table 2: Guidelines for interpreting a study using HRQOL in decision theory, which models the decision-making adjusted life-years (QALYs) gained.

Has the instrument demonstrated reliability?
� Has the instrument been shown to be reliable over repeated administrations (test re-test) to a stable population, similar in

LIMITING BIAS IN OUTCOMES EVALUATION
characteristics and disease severity to that of the current study? � If more than one rater was involved, was inter-rater reliability established?
Bias in the measurement of outcomes can be minimized  in their determination, independent adjudication of one or more persons is an excellent way to limit bias.

Outcome measurement and sample size
This section focuses on the choice of an outcome measure and sample size. The statistical power of a study is the probability that it will find a difference between two treatments when one actually exists. By convention, investigators set the acceptable study power to 80% (i.e. 20% chance of false-positive results). Small studies are at risk of being underpowered (study power <80%). Surgeons must endeavor to optimize the study power when they anticipate a small sample size for their studies. The choice of the main outcome variables may play a crucial role in such circumstances.
Bhandari et al evaluated the impact of the choice of outcome variable on the statistical power in trials of orthopedic trauma. 4 They hypothesized that small studies with continuous outcome variables (time to fracture union) would achieve higher estimates of study power than those that reported dichotomous outcome variables (% union rates). In a review of 196 RCTs published in 32 medical journals Bhandari et al identified a total of 19,942 patients. Study sample sizes ranged from 10 to 662 patients. The vast majority of the studies were conducted at only one center (99.0% or 194/196) and focused upon interventions related to fracture repair (99.0% or 194/196). Fractures of the hip were the primary focus of over one-third of the included studies (34.2% or 67/196). These authors identified 76 studies (39%) with sample sizes of 50 patients or less. Two groups were formed: 29 studies reported continuous outcomes and 47 studies reported dichotomous outcomes. The mean sample size of the studies in each the completion of the study, they will take the actual sample size used to calculate the study's power.
Moher and colleagues identified 383 randomized trials published in the top medical journals JAMA, New England Journal of Medicine and The Lancet. Although Moher et al did not compare the statistical power and the type of outcome variable, they evaluated 70 trials with negative results and found that 68% lacked acceptable statistical power (80%). 5 Lochner and colleagues identified 117 randomized trials in orthopedics with a negative result (nonsignificant result) and reported that over 90% lacked sufficient statistical power to make definitive conclusions. 6 Of the small randomized trials in this review, we identified 78% that were underpowered.
In conclusion, the prevalence of published studies that fail to meet acceptable standards of statistical power is widespread. Surgeons can limit this problem by carefully selecting the outcome variable to optimize the study power and obviate the need for large samples of patients. group was similar (P>0.05). Those studies that reported continuous outcomes had a significantly greater study power than those studies that reported dichotomous outcomes (P=0.042). Twice as many studies that reported continuous outcomes achieved conventionally acceptable study power (80% or more) than those that reported dichotomous outcomes (37% vs. 18.6%, respectively, P=0.04) Figure 1.
The power of a statistical test is typically a function of the magnitude of the treatment effect, the designated Type I error rate (α, risk of false-positive result) and the sample size (n). When designing a trial, investigators can decide upon the desired study power (typically 80%) and calculate the necessary sample size to achieve this goal. If investigators are conducting a post-hoc power analysis after Continuous variables are significantly better suited to improving statistical power in small trials than dichotomous variables.

SAMPLE SIZE CALCULATION
Even at best, a sample size calculation is based upon the best available "guestimate" of treatment difference between treatment groups.

Using confidence intervals for sample size calculation
From the equation above, our proposed study will require It can also be useful to calculate the precision of a study 90 patients per treatment arm to have adequate study based on the above sample size calculation. Precision is = 2(12 2 ) (1.96 + 0.84) 2 / 5 2 = 90.
defined as the width of the 95% confidence interval (CI).

2
Being 95% confident means that if we repeat the study an Reworking the above equation, the study power can be unlimited number of times, the true difference between calculated for any given sample size by transforming the groups will be included in the CI in 95% of the samples. above formula and calculating the z-score: For any power and clinically relevant or hypothesized z 1-β = (n 1 (∆ 2 )/2(σ 2 )) 1/2 -z 1-α/2 difference (∆) the predicted confidence interval can be calculated using this formula: The actual study power that corresponds to the calculated Predicted 95% CI = observed difference ± 0.7 ∆ z-score can be looked up in readily available statistical Predicted Precision = 2*0.7∆ 0.80 = 1.4∆ or on the internet (keyword: "z-table"). From where the above example the z-score will be 0.84 = (90(5 2 )/ ∆ 0.80 = true difference for which there is 80% power.
-1.96 for a sample size of 90 patients. The corresponding study power for a z-score of 0.84 is 80%.
Often, choosing an expected difference between two groups can be arbitrary. An alternative method to determine an

Comparing binomial proportions (percentages for
expected difference can be derived from using 95% dichotomous variables) confidence intervals. For example, rather then Let's now assume that we wish to change our outcome hypothesizing a 5% difference between operative and measure to differences in secondary surgical procedures nonoperative treatment of ankle fractures we might be more between operatively and nonoperatively treated ankle comfortable stating that we will not accept a confidence fractures. We consider a clinically important difference to interval for an observed difference that is wider than 7%. be 5%. Based upon the previous literature, we estimate Thus we can work backwards from our predicted that the secondary surgical rates in operatively and confidence interval to calculate the expected difference nonoperatively treated ankles will be 5% and 10%, between groups: respectively. The number of patients required for our study 0.07 = 1. can now be calculated as follows: n 1 = n 2 = [(2p m q m ) 1/2 z 1-α/2 + (p 1 q 1 + p 2 q 2 ) 1/2 z 1-β ] 2 / ∆ 2 where n 1 = sample size of Group one n 2 = sample size of Group two p 1 , p 2 = sample probabilities (5% and 10%) q 1 , q 2 = 1 -p 1 , 1 -p 2 (95% and 90%) p m = (p 1 + p 2 )/2 (7.5%) q m = 1 -p m (92.5%) ∆ = difference = p 2 -p 1 (5%) Now we can use the sample size calculation for the proportions above to calculate the number of patients required for our study.
Calculating the precision illustrates the trade-off between the magnitude of the hypothesized or clinically relevant difference used in the sample size calculation and the likelihood of finding a statistically significant difference. Choosing a higher hypothesized difference decreases the required number of studied subjects, but it also increases the predicted 95% confidence interval, which then is more IJO -January -March 2007 / Volume 41 / Issue 1 likely to include 0 and therefore yielding statistically not significant results. While it is tempting to "hypothesize" a larger difference of the primary outcome parameter in order to decrease the required sample size, it is therefore advisable to choose a realistic difference when calculating the required sample size. Also, the benefit of calculating the predicted precision is that it may be easier to understand for a nonstatistician that the primary outcome parameter