Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to choose which transformation suits the best for normalization of data?

    Hi all,

    I have some phenotypic data for which I need to test normality and perform transformation for normalization.

    I'm using Stata v11.0 and the data looks like this.

    use D:\CCMB\Stata11\IMS_All.dta

    sum agelastbday weight standingheight triglycerides totalchol hdl ldl insulin glucosemmol homacalculated mwaist hip whr bmi bodyfat

    Variable | Obs Mean Std. Dev. Min Max
    -------------+--------------------------------------------------------
    agelastbday | 6918 40.94681 10.30413 15 76
    weight | 6914 61.33692 12.49082 27.1 116.4
    standinghe~t | 6914 1603.769 88.35717 1322 1890
    triglyceri~s | 6911 1.471775 .7788138 .34 14.88
    totalchol | 6912 4.707168 1.132633 1.576227 10.98191
    -------------+--------------------------------------------------------
    hdl | 6909 1.163221 .2503314 .4392765 2.248062
    ldl | 6861 2.875935 .994252 .1498708 8.630491
    insulin | 6911 7.696543 8.633102 .5 200
    glucosemmol | 6909 5.299472 1.427457 3.108 22.422
    homacalcul~d | 6904 1.871994 2.775801 .0789333 138.9609
    -------------+--------------------------------------------------------
    mwaist | 6909 82.43045 11.74527 49.8 125.05
    hip | 6913 94.45445 9.497728 61.2 142.3
    whr | 6909 .8720521 .0831133 .4865788 1.177805
    bmi | 6914 23.83981 4.496902 12.23988 46.18964
    bodyfat | 6786 26.97245 8.165539 4.479143 46.09452

    sktest agelastbday weight standingheight triglycerides totalchol hdl ldl insulin glucosemmol homacalculated mwaist hip whr bmi bodyfat

    Skewness/Kurtosis tests for Normality
    ------- joint ------
    Variable | Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
    -------------+---------------------------------------------------------------
    agelastbday | 6.9e+03 0.0231 0.0000 . 0.0000
    weight | 6.9e+03 0.0000 0.0406 . 0.0000
    standinghe~t | 6.9e+03 0.9295 0.0000 . 0.0000
    triglyceri~s | 6.9e+03 0.0000 0.0000 . .
    totalchol | 6.9e+03 0.0000 0.0000 . 0.0000
    hdl | 6.9e+03 0.0000 0.3755 . 0.0000
    ldl | 6.9e+03 0.0000 0.0000 . 0.0000
    insulin | 6.9e+03 0.0000 0.0000 . .
    glucosemmol | 6.9e+03 0.0000 0.0000 . .
    homacalcul~d | 6.9e+03 0.0000 0.0000 . .
    mwaist | 6.9e+03 0.0000 0.0000 47.67 0.0000
    hip | 6.9e+03 0.0000 0.0000 . 0.0000
    whr | 6.9e+03 0.2972 0.0000 54.17 0.0000
    bmi | 6.9e+03 0.0000 0.0000 . 0.0000
    bodyfat | 6.8e+03 0.0000 0.0000 . 0.0000

    .
    ladder bmi

    Transformation formula chi2(2) P(chi2)
    ------------------------------------------------------------------
    cubic bmi^3 . .
    square bmi^2 . .
    identity bmi . 0.000
    square root sqrt(bmi) 70.25 0.000
    log log(bmi) 10.25 0.006
    1/(square root) 1/sqrt(bmi) 70.25 0.000
    inverse 1/bmi . 0.000
    1/square 1/(bmi^2) . 0.000
    1/cubic 1/(bmi^3) . .


    ladder glucosemmol

    Transformation formula chi2(2) P(chi2)
    ------------------------------------------------------------------
    cubic glucos~l^3 . .
    square glucos~l^2 . .
    identity glucos~l . .
    square root sqrt(glucos~l) . .
    log log(glucos~l) . .
    1/(square root) 1/sqrt(glucos~l) . .
    inverse 1/glucos~l . 0.000
    1/square 1/(glucos~l^2) . 0.000
    1/cubic 1/(glucos~l^3) . 0.000


    Can anyone help me clear these doubts:

    1) What does it mean when there is . in adjchi2(2) and prob>chi2 in the sktest output?
    2) What does it mean when there is . in chi2(2) and P(chi2) in the ladder output?
    3) How to pick the right transformation for data?

    Thanks in advance.

    Regards,
    Priyanka

  • #2
    I don't get past "which I need to test normality and perform transformation for normalization". Why?

    With 6900 or so measurements, such tests are meaningless. Virtually any departure from normality qualifies as significant at conventional levels, regardless of whether it is important. You might be better off looking at skewness and kurtosis directly. moments from SSC is a convenience command.

    Comment


    • #3
      I just wish to add that Priyanka could also prefer to perfom a graphical analysis, with boxplots or histograms, for example.
      Best regards,

      Marcos

      Comment


      • #4
        Hi,

        I was concerned if the data is distributed normally or not since the downstream analysis assumes normality of data.

        I have plotted the data using gladder and qladder functions. I have attached few images of the same. These are the plots for glucosemmol variable.

        IMS_All_FG_q.tif IMS_All_FG_g.tif

        From the figures, I could make out that glucosemmol variable is not normally distributed but inverse transformation of the data looks normally distributed.

        Should I be going just by observing the plots?

        What does the P(chi2) value of 0.000 in ladder test mean?





        Comment


        • #5
          Actually very few methods assume marginal normality.

          The P-values from ladder mean almost nothing for that sample size. The problem with ladder is that it is a shotgun. It tries out all kinds of transformations, most of which often make matters worse.

          Comment


          • #6
            One thing caught my attention. Typically, variables such as age, weight, bmi and blood lipids are expected to be - somewhat - "normally" distributed under several conditions, more so under this sample size, "hedged" by the central limit theorem...

            Furthermore, the assumption of normality of the residuals - not the variables thenmselves - would deserve attention.

            To end, please prefer to provide commands and output under CODE delimiters, as recommended in the FAQ.
            Last edited by Marcos Almeida; 28 Apr 2016, 09:28.
            Best regards,

            Marcos

            Comment

            Working...
            X