Data transformation

The following brief overview of Data Transformation is compiled from Howell (pp. 318-324, 2007) and Tabachnick and Fidell (pp. 86-89, 2007). See the references at the end of this handout for a more complete discussion of data transformation. Most people find it difficult to accept the idea of transforming data. Tukey (1977) probably had the right idea when he called data transformation calculations " reexpressions " rather than " transformations. " A researcher is merely reexpressing what the data have to say in other terms. However, it is important to recognize that conclusions that you draw on transformed data do not always transfer neatly to the original measurements. Grissom (2000) reports that the means of transformed variables can occasionally reverse the difference of means of the original variables. While this is disturbing, and it is important to think about the meaning of what you are doing, but it is not, in itself, a reason to rule out the use of transformations as a viable option. If you are willing to accept that is it permissible to transform one set of measures into another, then many possibilities become available for modifying the data to fit more closely the underlying assumptions of statistical tests. An added benefit about most of the transformations is that when we transform the data to meet one assumption, we often come closer to meeting other assumptions as well. For example, a square-root transformation may help equate group variances, and because it compresses the upper end of a distribution more than it compresses the lower end, it may also have the effect of making positively skewed distributions more nearly normal in shape. If you decide to transform, it is important to check that the variable is normally or nearly normally distributed after transformation. That is, make sure it worked. When it comes to reporting our data… although it is legitimate and proper to run a statistical test, such as the one-way analysis of variance, on the transformed values, we often report means in the unit of the untransformed scale. This is especially true when the original units are intrinsically meaningful. Howell (2007) urges researchers to look at both the converted (transformed) and unconverted (original) means and make sure that they are telling the same basic story. Do not convert standard deviations – you will do serious injustice if you try that. And be sure to indicate to your …

Sir, I read with interest the article by Manikanandan S, "Data transformation." [1] These kinds of articles are very helpful for postgraduates and young researchers. As the article is going to guide postgraduates during analyzing their data of research, I would like to comments on few issues raised in the article: 1. The author mentioned that the reason why the distribution is called normal distribution is that most of the biological variables (weight, height, and blood sugar) follow it.
Here I want to emphasize that normal distribution was not discovered for the distribution of biological variables and the reason it is called this is not because most of the biological variables follow it but that it is most frequently seen distribution in nature. The statement, "Most biological variable follow the normal distribution," needs to be analyzed cautiously as opinions are divided between the statisticians and researchers. If we are sure that biological variables like blood pressure, height, and weight always follow the normal distribution then there is no need to check distribution for these variables and parametric tests can be used straightaway but it is observed that distribution is also checked for these variables and nonparametric tests are also used when the distribution is found not following the normal distribution. The distribution of variables also depends on the sample size; when the sample size is small then there are more chances that the distribution becomes nonnormal. Most of the statistical tests are based on the "central limit theorem" and in the case of a small sample size, this theorem loses its validity. My advice for the postgraduates is to check the distribution of their data for all biological variables including weight, height, and blood pressure specially when the sample size is small and subjects are selected nonrandomly. Instead of declaring "most biological variables follow it," I believe it is better to say that "many biological variables follow it when subjects are selected randomly and the sample size is large." 2. The author mentioned that "one of the assumptions of statistical tests used for testing hypotheses is that data are sampled from normal distribution;" though the statement is essentially correct, some explanation is needed here to clarify it in an unambiguous way. Here I would like to clarify that fulfillment of this assumption is needed only for parametric statistical tests and not all statistical tests. Fulfillment of this assumption is needed for a "t-test" or "ANOVA" but not for the "Mann-Whitney," "Kruskal-Wallis," "chi-square," or , which can be used to check the distribution of data. I believe that the decision about the distribution of data should be taken after obtaining the results of all methods and also after understanding the distribution of the variable in the population from which the sample was taken. 4. At many places, the author mentioned "skewness" at the place of skewed distribution. Readers should understand that checking skewness (shifting of the curve to left or right) is one component of checking normal distribution as mentioned previously. Skewed distribution is nonnormal distribution. 5. The author mentioned that "once skewness (read "skewed distribution" or "nonnormal distribution") is identified, every attempt should be made to convert it into normal distribution." In this case also opinions are divided and some statisticians believe that instead of making various efforts to transform data, a nonparametric test can be applied to these kinds of data. 6. One more component is ignored in this article and that is "conversion of data." It is observed in various articles published in medical journals that sometimes continuous data are converted into categorical data (ordinal or nominal) by using "cut-off points." For example, blood pressure (ratio) data can be converted into hypertensive and nonhypertensive (nominal data) or mild hypertension, moderate hypertension, and severe hypertension (ordinal data). This conversion causes cause loss of information, and statistical tests are more sensitive to continuous data I thank the author for his interest in and comments [1] on our article. [2] The Postgraduate Corner section hosts a series of small articles on statistics. Many topics are planned in this series which might be published over 10-15 issues. The points relevant to the topic only are discussed. The article commented upon is about data transformation, and a separate one on normal distribution is also planned. Hence all details need not be provided in a single article. More details will appear in articles published later.
The first part of the article tries to explain when one needs to transform data. The objective of this part is not to inform the readers which test should be selected -parametric or nonparametric. So these details are not necessary here.
The author (of the letter) has not supported the first paragraph of point 1 with specific references. Still it is clarified that it is written on "biological parameters" and nature includes diverse biology. The examples using human biological parameters are illustrated since the readers of this journal belong to this field.
It is written in the published article, "This is called normal distribution as most of the biological parameters (such as weight, height and blood sugar) follow it." The author (of this letter) has already commented on this line in the first paragraph. Again the author is breaking this line as "Most of the biological parameters follow normal distribution" and is trying to argue.
This makes the argument out of context (even though what is written might be correct).
Point 2 is acceptable. It is appropriate to specify that only a parametric test needs this assumption.
The published article clearly describes only some simple ways to detect skewness. The intention here is to give a few simple methods and not an exhaustive list as the article is meant for young researchers and not statisticians. Even though the visual observation of a histogram and box and whisker plot is easy, it is unreliable, hence not mentioned. I like to quote Altman from his article published in the British Medical Journal, "Visual inspection of the distribution may suggest whether the assumption of normality is reasonable but (as Figure 3 suggests) this approach is unreliable." [3] The fact that normality can be checked by a statistical test is already mentioned in the published article.
The reason for advising data transformation instead of using a nonparametric test is already highlighted in the article -the parametric tests are more robust. If we cannot transform data, then one has to resort to nonparametric tests only.