Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • variable transformation and impact on p-values of estimates.

    Hello Statalist Community,
    I hope you are well.
    I am running a panel data analysis and I am trying to figure out how to model properly my predictors using square or log transformations. I would like to hear from you if I am adopting the right approach to figure out how to transform a variable to get more accurate estimates.
    I am presenting the example of two of my independent variables. The first one is NetInterestMargin. Here you can see some result of the statistics.
    Code:
    sktest NetInterestMargin
    
    Skewness and kurtosis tests for normality
                                                                  ----- Joint test -----
             Variable |       Obs   Pr(skewness)   Pr(kurtosis)   Adj chi2(2)  Prob>chi2
    ------------------+-----------------------------------------------------------------
    NetInterestMargin |     1,199         0.0030         0.0002         20.40     0.0000
    and this another info regarding the positive skewness.
    Code:
    tabstat NetInterestMargin, stats (sk)
    
        Variable |  Skewness
    -------------+----------
    NetInteres~n |  .2111044
    ------------------------
    There is positive skewness and not normal distribution. I even used the hist command but I am not bale to paste it here.
    In the case of the NetInterestMargin variable the log transformation does not seem to obtain positive result (still not normal distribution and greater negative skewness).
    Code:
     tabstat NetInterestMargin log_NetInterestMargin , stats (sk)
    
       Stats |  NetInt~n  log_Ne~n
    ---------+--------------------
    Skewness |  .2111044 -2.151288
    ------------------------------
    It does not seem the log transformation to be appropriate. Hovewer, the log transformation, ceteris paribus, is able to significantly affect if a variable is statistically significant or not. Here you can see by the Fixed effect estimator regression example.
    In the first attempt I used NetInterstMargin (first variable named log_NetInterestMargin) and this variable is not statistically significant.
    Code:
    Fixed-effects (within) regression               Number of obs     =      1,197
    Group variable: id                              Number of groups  =        109
    
    R-squared:                                      Obs per group:
         Within  = 0.7067                                         min =          9
         Between = 0.0007                                         avg =       11.0
         Overall = 0.2757                                         max =         11
    
                                                    F(21,1067)        =     122.40
    corr(u_i, Xb) = -0.0924                         Prob > F          =     0.0000
    
    -------------------------------------------------------------------------------------
           log_NPL_perc | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    --------------------+----------------------------------------------------------------
      NetInterestMargin |  -.0170019   .0359333    -0.47   0.636    -.0875098    .0535061
     AvgEquityAvgAssets |  -.0340388   .0072217    -4.71   0.000    -.0482091   -.0198684
           CosttoIncome |  -.0038549   .0010209    -3.78   0.000     -.005858   -.0018518
                   ROAA |  -.1165616   .0466176    -2.50   0.013    -.2080341   -.0250891
                    LLP |  -1.21e-07   2.63e-07    -0.46   0.646    -6.36e-07    3.95e-07
                 Assets |     26.958   10.27897     2.62   0.009     6.788705    47.12729
         deltabankloans |  -.0233581   .0057626    -4.05   0.000    -.0346655   -.0120507
           deltaFTSEMIB |   .0056594   .0008663     6.53   0.000     .0039596    .0073592
          RealGDPGrowth |   .0743855   .0041371    17.98   0.000     .0662677    .0825033
       deltaNCLDeposits |  -.0564172   .0032417   -17.40   0.000     -.062778   -.0500564
               dummy_25 |  -.1135303   .0980685    -1.16   0.247    -.3059594    .0788987
            dummy_50_75 |   .0063863    .059466     0.11   0.914    -.1102973      .12307
            dummy_25_50 |   -.165802   .0798618    -2.08   0.038    -.3225061    -.009098
           SIZE_25_ROAA |    .001881   .0640499     0.03   0.977    -.1237971    .1275591
           SIZE_50_ROAA |  -.0522251   .0774365    -0.67   0.500    -.2041703      .09972
           SIZE_75_ROAA |  -.0612316   .0615014    -1.00   0.320     -.181909    .0594457
       L1_RealGDPGrowth |   .1579416    .006119    25.81   0.000     .1459349    .1699482
       L2_RealGDPGrowth |    .090434    .005898    15.33   0.000      .078861    .1020071
        L1_deltaFTSEMIB |   .0041794   .0005927     7.05   0.000     .0030165    .0053423
      L1_deltabankloans |    -.06782   .0060265   -11.25   0.000    -.0796451   -.0559949
    L1_deltaNCLDeposits |  -.0751297   .0032144   -23.37   0.000     -.081437   -.0688224
                  _cons |  -2.004619   .1169712   -17.14   0.000    -2.234139   -1.775099
    --------------------+----------------------------------------------------------------
                sigma_u |  .65158704
                sigma_e |   .3224297
                    rho |  .80330051   (fraction of variance due to u_i)
    -------------------------------------------------------------------------------------
    F test that all u_i=0: F(108, 1067) = 26.96                  Prob > F = 0.0000
    Here I used the log transformation (first variable named log_NetInterestMargin)

    Code:
    Fixed-effects (within) regression               Number of obs     =      1,197
    Group variable: id                              Number of groups  =        109
    
    R-squared:                                      Obs per group:
         Within  = 0.7077                                         min =          9
         Between = 0.0102                                         avg =       11.0
         Overall = 0.2452                                         max =         11
    
                                                    F(21,1067)        =     123.05
    corr(u_i, Xb) = -0.1402                         Prob > F          =     0.0000
    
    ---------------------------------------------------------------------------------------
             log_NPL_perc | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    ----------------------+----------------------------------------------------------------
    log_NetInterestMargin |  -.1176943   .0573763    -2.05   0.040    -.2302774   -.0051111
       AvgEquityAvgAssets |  -.0319452   .0071842    -4.45   0.000    -.0460419   -.0178484
             CosttoIncome |  -.0040174   .0010115    -3.97   0.000    -.0060022   -.0020325
                     ROAA |  -.1093108   .0466724    -2.34   0.019    -.2008909   -.0177306
                      LLP |  -8.41e-08   2.63e-07    -0.32   0.749    -5.99e-07    4.31e-07
                   Assets |   31.68001   10.45083     3.03   0.002      11.1735    52.18651
           deltabankloans |  -.0262004   .0056822    -4.61   0.000    -.0373499   -.0150508
             deltaFTSEMIB |   .0058295   .0008614     6.77   0.000     .0041394    .0075197
            RealGDPGrowth |   .0737951   .0041224    17.90   0.000     .0657063     .081884
         deltaNCLDeposits |  -.0576963   .0032084   -17.98   0.000    -.0639918   -.0514008
                 dummy_25 |  -.1096407   .0978907    -1.12   0.263    -.3017208    .0824395
              dummy_50_75 |   .0115595   .0594111     0.19   0.846    -.1050163    .1281352
              dummy_25_50 |  -.1628232   .0797169    -2.04   0.041    -.3192428   -.0064035
             SIZE_25_ROAA |  -.0033336   .0639285    -0.05   0.958    -.1287733    .1221062
             SIZE_50_ROAA |  -.0571957    .077265    -0.74   0.459    -.2088043    .0944128
             SIZE_75_ROAA |  -.0649643   .0613895    -1.06   0.290    -.1854221    .0554935
         L1_RealGDPGrowth |   .1595236   .0060378    26.42   0.000     .1476762    .1713709
         L2_RealGDPGrowth |   .0894354   .0058843    15.20   0.000     .0778893    .1009814
          L1_deltaFTSEMIB |   .0040979   .0005889     6.96   0.000     .0029423    .0052535
        L1_deltabankloans |  -.0646026   .0057948   -11.15   0.000    -.0759731   -.0532321
      L1_deltaNCLDeposits |  -.0744212    .003186   -23.36   0.000    -.0806728   -.0681697
                    _cons |  -1.987116    .101714   -19.54   0.000    -2.186698   -1.787534
    ----------------------+----------------------------------------------------------------
                  sigma_u |  .67220621
                  sigma_e |  .32182958
                      rho |  .81352599   (fraction of variance due to u_i)
    ---------------------------------------------------------------------------------------
    F test that all u_i=0: F(108, 1067) = 26.62                  Prob > F = 0.0000
    To sum up, What would recommend me to do with the transformation? Do you think this approach is reasonable and what would you do in a situation in which the log transformation does not seem to make a variable distribution normal or solve the skewness but at the same time a not significant variable may turn to be significant with the log transformation?

    In this other variable case, the log transformation seems more appropriate and the variable is in both case significant. Therefore, I would say that the log transformation is appropriate-. Do you agree with me?
    Code:
     tabstat log_AvgEquityAvgAssets AvgEquityAvgAssets,stats (sk)
    
       Stats |  log_Av~s  AvgEqu~s
    ---------+--------------------
    Skewness |  -.346894   1.60862
    ------------------------------
    In more general terms, do you think that the combination of hist, tabstat, and sktest together with plot dependent variable and independent variable represents a valid and appropriate approach and the right combination of instruments provided by Stata?

    Thanks everybody who will help me. Greetings to everyone.
    Kind Regards,

  • #2
    Salvatore:
    who taught you that your regressand and regressors should follow a normal distribution (outside the textbooks)?
    Normality is a weak requirements for the elements of the composed error terms (ui and eit) in panel data regression.
    Logging the regressand can fix (sometimes) issues related to heteroskedasticity; however, as you have 109 panels, you are almost forced to call in clustered-robust standard errors, that hendle both heteroskedasticity amd/or autocorrelation of the epsilon term.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      To add to @Carlo Lazzaro's point, logarithm transformation makes the skewness much worse. as you say. This can be illuminated by a scatter plot of the original and logarithmic versions but perhaps outliers have been produced by the transformation, i.e. small values become very small.

      An introductory text like Jeff Wooldridge's explains that transformations can be useful, but normality of marginal distribution isn't really a goal at all for what you're doing.

      Comment


      • #4
        Good morning Carlo and Nick. Thanks to both of your precious and helpful pieces of advice.

        To Carlo and Nick:
        I knew normality was not a strong requirement and that textbooks give the wrong impression that following a normal distribution is somehow "crucial". However, my request stemmed from the fact that by doing the log transformation, a variables p-value may change significantly even with robust standard errors. Let me show you.
        Code:
        xtreg log_NPL_perc NetInterestMargin AvgEquityAvgAssets CosttoIncome ROAA LLP Assets deltabankloans deltaFTSEMIB RealGDPGrowth deltaNC
        > LDeposits dummy_25 dummy_50_75 dummy_25_50 SIZE_25_ROAA SIZE_50_ROAA SIZE_75_ROAA L1_RealGDPGrowth L2_RealGDPGrowth L1_deltaFTSEMIB L1
        > _deltabankloans L1_deltaNCLDeposits, fe vce(cluster id)
        In this case I use simply NetInterestMargin.
        And these are the results:
        Code:
        Fixed-effects (within) regression               Number of obs     =      1,197
        Group variable: id                              Number of groups  =        109
        
        R-squared:                                      Obs per group:
             Within  = 0.7067                                         min =          9
             Between = 0.0007                                         avg =       11.0
             Overall = 0.2757                                         max =         11
        
                                                        F(21,108)         =      63.28
        corr(u_i, Xb) = -0.0924                         Prob > F          =     0.0000
        
                                                  (Std. err. adjusted for 109 clusters in id)
        -------------------------------------------------------------------------------------
                            |               Robust
               log_NPL_perc | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
        --------------------+----------------------------------------------------------------
          NetInterestMargin |  -.0170019   .0588895    -0.29   0.773     -.133731    .0997273
         AvgEquityAvgAssets |  -.0340388   .0174072    -1.96   0.053    -.0685429    .0004654
               CosttoIncome |  -.0038549   .0017252    -2.23   0.028    -.0072745   -.0004353
                       ROAA |  -.1165616   .0576273    -2.02   0.046    -.2307889   -.0023343
                        LLP |  -1.21e-07   4.24e-07    -0.28   0.777    -9.61e-07    7.20e-07
                     Assets |     26.958   19.26195     1.40   0.165    -11.22252    65.13852
             deltabankloans |  -.0233581   .0064178    -3.64   0.000    -.0360792    -.010637
               deltaFTSEMIB |   .0056594   .0007129     7.94   0.000     .0042463    .0070725
              RealGDPGrowth |   .0743855    .003727    19.96   0.000     .0669979    .0817731
           deltaNCLDeposits |  -.0564172    .005069   -11.13   0.000    -.0664648   -.0463696
                   dummy_25 |  -.1135303    .134461    -0.84   0.400    -.3800553    .1529946
                dummy_50_75 |   .0063863   .0857588     0.07   0.941    -.1636025    .1763752
                dummy_25_50 |   -.165802   .1027842    -1.61   0.110    -.3695382    .0379342
               SIZE_25_ROAA |    .001881   .0716246     0.03   0.979    -.1400913    .1438534
               SIZE_50_ROAA |  -.0522251   .0767989    -0.68   0.498    -.2044539    .1000036
               SIZE_75_ROAA |  -.0612316   .0644833    -0.95   0.344    -.1890488    .0665855
           L1_RealGDPGrowth |   .1579416   .0103784    15.22   0.000     .1373698    .1785133
           L2_RealGDPGrowth |    .090434    .005007    18.06   0.000     .0805092    .1003589
            L1_deltaFTSEMIB |   .0041794   .0004197     9.96   0.000     .0033474    .0050113
          L1_deltabankloans |    -.06782   .0058514   -11.59   0.000    -.0794184   -.0562216
        L1_deltaNCLDeposits |  -.0751297   .0035868   -20.95   0.000    -.0822394   -.0680201
                      _cons |  -2.004619   .2249334    -8.91   0.000    -2.450476   -1.558762
        --------------------+----------------------------------------------------------------
                    sigma_u |  .65158704
                    sigma_e |   .3224297
                        rho |  .80330051   (fraction of variance due to u_i)
        ----------------------------------------------------------------------------------
        The p-value is 0.773

        In this other case, I used the log transformation.
        Code:
        . xtreg log_NPL_perc log_NetInterestMargin AvgEquityAvgAssets CosttoIncome ROAA LLP Assets deltabankloans deltaFTSEMIB RealGDPGrowth del
        > taNCLDeposits dummy_25 dummy_50_75 dummy_25_50 SIZE_25_ROAA SIZE_50_ROAA SIZE_75_ROAA L1_RealGDPGrowth L2_RealGDPGrowth L1_deltaFTSEMI
        > B L1_deltabankloans L1_deltaNCLDeposits, fe vce(cluster id)
        And these are the results.

        Code:
        Fixed-effects (within) regression               Number of obs     =      1,197
        Group variable: id                              Number of groups  =        109
        
        R-squared:                                      Obs per group:
             Within  = 0.7077                                         min =          9
             Between = 0.0102                                         avg =       11.0
             Overall = 0.2452                                         max =         11
        
                                                        F(21,108)         =      56.90
        corr(u_i, Xb) = -0.1402                         Prob > F          =     0.0000
        
                                                    (Std. err. adjusted for 109 clusters in id)
        ---------------------------------------------------------------------------------------
                              |               Robust
                 log_NPL_perc | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
        ----------------------+----------------------------------------------------------------
        log_NetInterestMargin |  -.1176943   .1031952    -1.14   0.257    -.3222451    .0868565
           AvgEquityAvgAssets |  -.0319452   .0179631    -1.78   0.078    -.0675512    .0036609
                 CosttoIncome |  -.0040174   .0016765    -2.40   0.018    -.0073406   -.0006942
                         ROAA |  -.1093108   .0557906    -1.96   0.053    -.2198974    .0012758
                          LLP |  -8.41e-08   3.91e-07    -0.21   0.830    -8.60e-07    6.92e-07
                       Assets |   31.68001   18.00386     1.76   0.081    -4.006771    67.36678
               deltabankloans |  -.0262004   .0065948    -3.97   0.000    -.0392725   -.0131283
                 deltaFTSEMIB |   .0058295   .0006945     8.39   0.000      .004453    .0072061
                RealGDPGrowth |   .0737951   .0037688    19.58   0.000     .0663247    .0812656
             deltaNCLDeposits |  -.0576963    .005105   -11.30   0.000    -.0678152   -.0475774
                     dummy_25 |  -.1096407   .1359168    -0.81   0.422    -.3790513      .15977
                  dummy_50_75 |   .0115595   .0868522     0.13   0.894    -.1605966    .1837155
                  dummy_25_50 |  -.1628232   .1046981    -1.56   0.123     -.370353    .0447067
                 SIZE_25_ROAA |  -.0033336   .0725813    -0.05   0.963    -.1472022    .1405351
                 SIZE_50_ROAA |  -.0571957     .07599    -0.75   0.453    -.2078211    .0934296
                 SIZE_75_ROAA |  -.0649643   .0646736    -1.00   0.317    -.1931587    .0632301
             L1_RealGDPGrowth |   .1595236   .0101262    15.75   0.000     .1394517    .1795954
             L2_RealGDPGrowth |   .0894354   .0051544    17.35   0.000     .0792184    .0996523
              L1_deltaFTSEMIB |   .0040979   .0004228     9.69   0.000       .00326    .0049359
            L1_deltabankloans |  -.0646026   .0058019   -11.13   0.000    -.0761029   -.0531023
          L1_deltaNCLDeposits |  -.0744212   .0036557   -20.36   0.000    -.0816674    -.067175
                        _cons |  -1.987116   .1975562   -10.06   0.000    -2.378707   -1.595525
        ----------------------+----------------------------------------------------------------
                      sigma_u |  .67220621
                      sigma_e |  .32182958
                          rho |  .81352599   (fraction of variance due to u_i)
        ---------------------------------------------------------------------------------------
        
        .
        Now the p-value is 0.25.

        Despite the fact that in this case, the variable is still not significant, the p-value has changed significantly and it may be the case that with a transformation like this one variable may even become statistically significant.
        This is the point behind my previous statements and not the normal distribution itself. I was wrong in the way of expressing it.

        Having said that, simply ignoring the reasoning related to the normal distribution approximation:

        How would you suggest I should understand which can be a solid and appropriate transformation for one variable? Which command (test or graphs) would you recommend me to use?
        Maybe scatterplot of the regressand and the regressor or anything else?


        Thanks a lot for your help.
        Kind regards,

        Comment


        • #5
          I've already suggested

          Code:
          scatter log_NetInterestMargin NetInterestMargin
          which may seem futile because of necessity all points will lie on the curve defining the relation. But it could be helpful in seeing whether points pop out on the left of the graph. In the same spirit
          Code:
           spikeplot NetInterestMargin spikeplot logNetInterestMargin
          would show the fine structure of each distribution, Behind that I am no economist and quite unfit to comment on which predictors deserve inclusion as testing a theory and/or as quantifying effects that may exist but won't necessarily rate as significant at conventional levels. Nevertheless I wonder whether a much simpler model would be more helpful.

          Comment


          • #6
            Nick:
            thanks for your help.
            Since this is the first time I used the command scatter confronting to different transformation of the same variable, may I ask you what do you think about the result?
            It seemed to me that at left bottom of the graph there are some "points" disconnected to the others. What comments would you make based on this evidence?
            You will find the graph as an attachment.
            Thanks a lot.
            Regards,
            Attached Files

            Comment


            • #7
              Nick:
              thanks for your help.
              Since this is the first time I used the command scatter confronting to different transformations of the same variable, may I ask you what do you think about the result?
              It seemed to me that at the left bottom of the graph there are some "points" disconnected from the others. What comments would you make based on this evidence?
              You will find the graph as an attachment.
              Thanks a lot.
              Regards,
              Attached Files

              Comment


              • #8
                As transformations -- including when they give puzzling results -- are a strong personal interest, I wanted to understand why logging changed the skewness from .211 to -2.151, which is quite a big change. The scatter plot doesn't show any strong outliers on either scale, so the answer must be more subtle, which is why I asked also to see

                Code:
                spikeplot NetInterestMargin 
                
                spikeplot log_NetInterestMargin 
                It's not that I don't trust the results; I just want to see, and I want you to see, quite why they occur.

                Comment


                • #9
                  Nick:
                  thanks for your feedback. Attached you will find what you need, as requested.
                  Thanks for your help.
                  Regards,
                  Salvatore

                  Attached Files

                  Comment

                  Working...
                  X