normality distribution

parastou amirifar

Join Date: Oct 2018

Posts: 20
#1

normality distribution

01 Jan 2019, 11:46

Hi. I work with efficiency data that is between zero and one and use the following command to transform data.
[gen angular = asin(sqrt(efficiency)]
However, after the Kolmogrov-Smirnov normality distribution test for independent t-test, the distribution of transformed data was not normal. How can I normalize data distribution?
Thanks
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

01 Jan 2019, 12:17

Why do you want to normalize this variable? First, it may or may not be possible. Second, it is almost never necessary. (People often think they need to have a normal distribution in the outcome variable of a regression or ANOVA, but that is not true--it is a very common mistake.) Third, even in situations where at least approximate normality is needed, the use of K-S or other normality tests often, in large samples, rejects samples that are close enough for the purpose, or, in small samples, fails to reject samples that are too non-normal for the purpose.

So you need to say more about just why you are trying to do this, and also provide more information about the distribution of the variable you have.
Comment
parastou amirifar

Join Date: Oct 2018

Posts: 20
#3

01 Jan 2019, 12:58

Based on the references I read, one of the default assumptions of independent T-test is the normality of variables distribution. And my professor suggests kstest.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

01 Jan 2019, 13:33

Well, in the classical theory of the independent t-test, one assumption is that the outcome variable is normally distributed in each group, but not overall. However, much research has been done about the robustness of this test to departures from normality. Suffice it to say that in large samples, unless the distribution in one or both groups is very highly skew, the assumption of normality can be dispensed with. The underlying justification is that the t-test is equivalent to a regression on a single dichotomous predictor variable "dummy variable." And viewed in that way, the numerator and denominator of the t-statistic each reduce to a sum of identically distributed random variables, and therefore by the central limit theorem, they are asymptotically normal, even when the underlying distribution of the outcome variable is not. So if your sample is reasonably large, the t-test will give correct results anyway. Just how large the sample has to be depends on how far from normality the outcome variable distribution is, but except for very unusual distributions, typical research data samples with hundreds of observations are sufficient. For only mildly abnormal distributions, even a dozen or so observations may suffice to rescue the t-statistic from abnormality.

It is worth noting that not all distributions can be normalized. To take a very obvious example, if you had a distribution that is just a spike at a single value, no transformation will change the shape of the distribution--all you can do is change its location. A general purpose trick for normalizing those distributions that can be normalized is to

Code:

sort y gen y2 = invnormal(_n/_N)

y2 will then have a standard normal distribution.

But this may involve considerable distortion of the original data. For example, if the original distribution of y is bimodal, the resulting normalized distribution is quite deformed compared to the original data and inferences drawn based on it are no better than those you would get using non-parametric statistics instead of a t-test. And, of course, doing this separately in two subsets will necessarily result in the means being 0 in both groups, so that a ttest contrasting them becomes useless.

Again, while I think this is not something you should really do, if for purposes of exploration and learning you want to (or have to) go through this exercise, you can try a series of different transformations including powers, reciprocal, and logarithm to see if any of these get you the desired result. There is a Stata command -ladder- that simplifies this process for you. See -help ladder-.
3 likes
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#5

01 Jan 2019, 14:56

Parastou, I think Clyde's answer in #4 answers your question, but he had an excellent reply to someone else asking about the assumption of normality here.

Originally posted by Clyde Schechter View Post

First let's be clear what assumption of normality we're talking about here. People are often under the impression that the predictor variables or the outcome variable need to have a normal distribution--this is not even remotely true. There is, and never has been, any such restriction, on any kind of regression (panel or otherwise). Where normality sometimes comes into play is that when the residuals of the regression are normally distributed, it can be easily proved that the coefficients divided by their standard errors have t-distributions, and you can do hypothesis testing based on those t-statistics.

So normality of residuals is a sufficient condition for correct inference based on t- or z- statistics. It is not, however, always necessary. If the sample size is large, then it can be shown using the central limit theorem that the sampling distributions of the coefficients and their standard errors are (asymptotically) normal and chi square (respectively) anyway, so that the t-/z- statistics are again correct.

So normality of residuals is only a concern in small samples. Even there, if the residual distribution is not too far from normal (especially if it is symmetric) then the t- and z- statistics' sampling distributions are reasonably well approximated by the corresponding t- and z- distributions so that hypothesis testing using them will have nearly the nominal Type I error rates.

Finally, I want to emphasize that even when all of those "rescues" fail and non-normality is a problem (i.e. small sample with a nasty residual distribution), it only affects the validity of p-values. Even in this worst case scenario, it remains true that the estimated coefficients are unbiased (ordinary least squares regression) or consistent (fixed-effects panel regression) estimators of the population-level coefficients, and the standard errors are good estimates of the standard deviation of the sampling distribution. So if p-values are not important to answering your research question, you don't have to think about normality in any circumstance.
Comment
parastou amirifar

Join Date: Oct 2018

Posts: 20
#6

02 Jan 2019, 00:53

Thanks Clyde & David. I had 28 data for both groups, and the kstest response was not normal, but according to Clyde answer, I tried it in each group. The distribution of data for the first group was normal, but not normal for the second group. I tried all the transformations, but the distribution of data in the second group was not normal.

Last edited by parastou amirifar; 02 Jan 2019, 01:49.
Comment
parastou amirifar

Join Date: Oct 2018

Posts: 20
#7

02 Jan 2019, 06:43

I tried 1/cubic transformation function . It works. But it made a lot of changes to the original data and the data was completely reversed. Is this the right method for data transformation? it is the only transformation process that cause normal distribution of data.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#8

02 Jan 2019, 07:24

Parastou, given that you have only 56 observations (if I understood correctly), perhaps you should share the data with us (group code and dependent variable only, in case there are confidentiality issues). See Point 12 in the FAQ regarding the use of -dataex-. HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#9

02 Jan 2019, 07:26

The reciprocal cube necessarily inverts order. It's hard to see that it's a good choice for data on (0, 1] as presumably your data are. (Guessing wildly, efficiency zero is not observed but efficiency one may be possible.

Also, only for a value of 1 is the value of its reciprocal cube identical, so almost always the transformed value is not the same, but no transformations at all are valid if the criterion is leaving the data unchanged!

A transformation that strong may be chosen on dubious grounds, e.g. to accommodate moderate outliers.

Why not tell us more about the data, giving them as a data example?
1 like
Comment
parastou amirifar

Join Date: Oct 2018

Posts: 20
#10

02 Jan 2019, 08:32

Here is my data:

unit efficiency angularefficiency
1 0.846 1.008438
1 0.972 1.333597
1 1 1.570796
1 0.976 1.351267
1 0.817 0.956189
1 0.928 1.189008
1 0.85 1.015985
1 1 1.570796
1 0.962 1.294235
1 1 1.570796
1 0.859 1.033313
1 1 1.570796
1 0.872 1.059273
1 0.966 1.309284
2 0.916 1.157994
2 1 1.570796
2 0.961 1.290596
2 0.932 1.199892
2 1 1.570796
2 0.976 1.351267
2 1 1.570796
2 0.948 1.246892
2 1 1.570796
2 1 1.570796
2 0.961 1.290596
2 1 1.570796
2 1 1.570796
2 1 1.570796

Angularefficiency variable is created after the following command.

[generate angularefficiency = asin(efficiency)]

With the Kolmogrov-Smirnov test, the distribution of the first group is normal and the distribution of the second group is abnormal. Can this be due to the normal distribution test? I read somewhere that when the number of data is under 50, Shapiro Wilk is a better test and kstest is suitable for data up to 50.

Last edited by parastou amirifar; 02 Jan 2019, 08:36.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#11

02 Jan 2019, 09:28

I'd just use a permutation test here, which obviates normality and similar assumptions. You can just do:

Code:

permute unit r(t), reps(10000): ttest efficiency, by(unit)

for which I obtain a two-tailed p = 0.029, vs. p = 0.028 for the conventional t-test.

Or, if you want to experience the aesthetic pleasure <grin> of examining all of the permutations, rather than a sample, and can stand to wait for 30 minutes or so to analyze all 28 choose 14 combinations, you could use the community-contributed program -tsrtest- (-findit tsrtrest-), of which I'm an author:

Code:

tsrtest unit r(t), exact: ttest efficiency, by(unit)
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#12

02 Jan 2019, 09:47

Thank you for posting the data. This is indeed a small sample, and the empirical distributions are very non-normal indeed. For unit 2, the distribution shows a very strong bunching up close to the upper limit, as if there is a ceiling effect. By contrast, the distribution for unit 1 is a bit more broadly spread, and perhaps has a second mode near the low end. In any case, the apparent distribution in unit 1 is, if anything, something like an upside down normal curve, with a gap rather than a peak in the middle. I also observe that these variables show pretty substantial heteroskedasticity: the variance ratio for unit = 1 to unit = 2 is greater than 5. This is another strike against using the Student t-test. Of course, with only 14 observations in each of these, the appearances could be quite misleading as to what goes on in the larger world. But we can probably all agree here that the use of a Student t-test on the efficiency variable is not appropriate.

The problem proceeding is that these distributions cannot be normalized without doing real violence to the nature of the data. And any tests based on transformations that accomplished normalization would be essentially incapable of being meaningfully translated back to the original metric of the data. I really would not go there at all. I would use a non-parametric test such as the Wilcoxon rank-sum (equivalent to Mann-Whitney U) here, or quantile regression of the medians on a unit indicator, or bootstrap resampling of the difference of means (or medians) might be appropriate here.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#13

02 Jan 2019, 10:02

Thanks for the data.

Getting a normal distribution out of these data is more than usually difficult.

1. They are bounded, and the bounds bite in that several values are reported as (exactly) 1. (The normal distribution is not bounded.)

2. There are spikes and no transformation worthy of the name will do anything but map spikes to spikes.

3. Over the narrow observed range powers (including reciprocal powers) will only be slightly nonlinear, as witness this graph for the reciprocal cube.
(Note reversal of the y axis, to correct the effect that surprised you.)

Code:

local labels 0.8 "0.8" 0.85 "0.85" 0.9 "0.9" 0.95 "0.95" 1 "1" gen trans = 1/(eff^3) scatter trans eff , ysc(reverse) ytitle(1/efficiency{sup:3}) yla(, ang(h)) ms(Oh) xla(`labels') name(G0, replace)

That transformation really isn't worthwhile, as the complexity of interpretation (unless you can produce some esoteric reasons why reciprocal cube makes sense) isn't justified given the slight change in the distribution,

There are other transformations possible for bounded responses. Plain or vanilla logit is ruled out here because logit(1) is indeterminate. Something like a folded root or cube root is possible, yet for these data such transformations just widen the gap between values of 1 and values less than 1, and thus don't seem helpful.

Verdict: transformation here is futile. Need that be a problem? We can still

(1) plot the data

(2) try various tests, proceeding even more carefully than usual. It seems convenient to flip the units around, so that unit 2 comes first in the comparison.

Code:

gen Unit = 3 - unit label def Unit 1 "Unit 2" 2 "Unit 1" label val Unit Unit

What happens with a t test?

Code:

. ttest eff, by(Unit) Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- unit 2 | 14 .9781429 .0078844 .0295006 .9611097 .995176 unit 1 | 14 .932 .0182139 .0681503 .8926512 .9713488 ---------+-------------------------------------------------------------------- combined | 28 .9550714 .0107026 .0566326 .9331116 .9770313 ---------+-------------------------------------------------------------------- diff | .0461429 .0198472 .0053464 .0869393 ------------------------------------------------------------------------------ diff = mean(unit 2) - mean(unit 1) t = 2.3249 Ho: diff = 0 degrees of freedom = 26 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.9859 Pr(|T| > |t|) = 0.0282 Pr(T > t) = 0.0141

Looks good, but we're relying on robustness.

How about Wilcoxon-Mann-Whitney?

Code:

. ranksum efficiency, by(Unit) porder Two-sample Wilcoxon rank-sum (Mann-Whitney) test Unit | obs rank sum expected -------------+--------------------------------- unit 2 | 14 239.5 203 unit 1 | 14 166.5 203 -------------+--------------------------------- combined | 28 406 406 unadjusted variance 473.67 adjustment for ties -37.33 ---------- adjusted variance 436.33 Ho: effici~y(Unit==unit 2) = effici~y(Unit==unit 1) z = 1.747 Prob > |z| = 0.0806 P{effici~y(Unit==unit 2) > effici~y(Unit==unit 1)} = 0.686

See what I did there? I asked for a really interesting descriptive summary, the probability that unit 2 has higher efficiency than unit 1. (Probability 0.5 would mean equal efficiencies.)

The P-value isn't so good, which doesn't seem surprising. The ranks can only understate the much lower values for some of the observations in unit 1.

What about graphs? I here show quantile plots (using qplot from the Stata Journal) and spike plots, just two of several possible "honest" graphs for these data.
With the quantile plots, a normal distribution would plot as linear on the chosen scale, which isn't true at all.

Code:

local labels 0.8 "0.8" 0.85 "0.85" 0.9 "0.9" 0.95 "0.95" 1 "1" qplot efficiency, over(unit) trscale(invnormal(@)) yla(`labels', ang(h)) name(G1, replace) spikeplot efficiency, by(unit) subtitle(, fcolor(none)) /// yla(, ang(h)) xla(`labels') name(G2, replace)

It is easy to think that unit 2 does better. You just need to add major reservations when reporting a t test, because its assumptions aren't well satisfied.

EDIT: This took a while drafting and Mike and Clyde posted while I was doing that. I would not say that we agree exactly, but the overlaps in advice are considerable.

Last edited by Nick Cox; 02 Jan 2019, 10:06.
2 likes
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#14

02 Jan 2019, 11:13

Thanks for posting your data, Parastou. To make things just a bit easier for forum members, you could have issued this command (with your dataset open):

Code:

dataex unit efficiency

The output from that command is Stata code that others can use to generate your dataset. Here is that code:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte unit float efficiency 1 .846 1 .972 1 1 1 .976 1 .817 1 .928 1 .85 1 1 1 .962 1 1 1 .859 1 1 1 .872 1 .966 2 .916 2 1 2 .961 2 .932 2 1 2 .976 2 1 2 .948 2 1 2 1 2 .961 2 1 2 1 2 1 end

Re the dataex command, note the following information from the FAQ:

As from Stata 15.1 (and 14.2 from 19 December 2017), dataex is included with the official Stata distribution. Users of Stata 15 (or 14) must update to benefit from this.

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
2 likes
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1133

#15

02 Jan 2019, 11:45

What do others think about using fracreg for a problem like this (help fracreg)? Like other types of regression models, I'm sure it works better when the sample size is a bit larger. But are there any other issues to consider?

Code:

generate byte g2 = unit==2 // indicator for Unit 2 membership
fracreg logit efficiency g2
fracreg probit efficiency g2

Code:

. fracreg logit efficiency g2

Iteration 0:   log pseudolikelihood = -10.303321  
Iteration 1:   log pseudolikelihood = -5.0219476  
Iteration 2:   log pseudolikelihood =  -4.952518  
Iteration 3:   log pseudolikelihood = -4.9506243  
Iteration 4:   log pseudolikelihood = -4.9506225  
Iteration 5:   log pseudolikelihood = -4.9506225  

Fractional logistic regression                  Number of obs     =         28
                                                Wald chi2(1)      =       6.65
                                                Prob > chi2       =     0.0099
Log pseudolikelihood = -4.9506225               Pseudo R2         =     0.0354

------------------------------------------------------------------------------
             |               Robust
  efficiency |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          g2 |   1.183303   .4588036     2.58   0.010     .2840645    2.082542
       _cons |   2.617825   .2820224     9.28   0.000     2.065071    3.170579
------------------------------------------------------------------------------

. fracreg probit efficiency g2

Iteration 0:   log pseudolikelihood = -7.1825212  
Iteration 1:   log pseudolikelihood = -4.9755712  
Iteration 2:   log pseudolikelihood = -4.9507237  
Iteration 3:   log pseudolikelihood = -4.9506225  
Iteration 4:   log pseudolikelihood = -4.9506225  

Fractional probit regression                    Number of obs     =         28
                                                Wald chi2(1)      =       6.83
                                                Prob > chi2       =     0.0090
Log pseudolikelihood = -4.9506225               Pseudo R2         =     0.0354

------------------------------------------------------------------------------
             |               Robust
  efficiency |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          g2 |   .5259669   .2012482     2.61   0.009     .1315277    .9204061
       _cons |   1.490853   .1361255    10.95   0.000     1.224052    1.757654
------------------------------------------------------------------------------

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Announcement

normality distribution

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment