variable transformation and impact on p-values of estimates.

Salvatore Greco

Join Date: Jun 2022
Posts: 41

variable transformation and impact on p-values of estimates.

06 Aug 2022, 06:22

Hello Statalist Community,
I hope you are well.
I am running a panel data analysis and I am trying to figure out how to model properly my predictors using square or log transformations. I would like to hear from you if I am adopting the right approach to figure out how to transform a variable to get more accurate estimates.
I am presenting the example of two of my independent variables. The first one is NetInterestMargin. Here you can see some result of the statistics.

Code:

sktest NetInterestMargin

Skewness and kurtosis tests for normality
                                                              ----- Joint test -----
         Variable |       Obs   Pr(skewness)   Pr(kurtosis)   Adj chi2(2)  Prob>chi2
------------------+-----------------------------------------------------------------
NetInterestMargin |     1,199         0.0030         0.0002         20.40     0.0000

and this another info regarding the positive skewness.

Code:

tabstat NetInterestMargin, stats (sk)

    Variable |  Skewness
-------------+----------
NetInteres~n |  .2111044
------------------------

There is positive skewness and not normal distribution. I even used the hist command but I am not bale to paste it here.
In the case of the NetInterestMargin variable the log transformation does not seem to obtain positive result (still not normal distribution and greater negative skewness).

Code:

 tabstat NetInterestMargin log_NetInterestMargin , stats (sk)

   Stats |  NetInt~n  log_Ne~n
---------+--------------------
Skewness |  .2111044 -2.151288
------------------------------

It does not seem the log transformation to be appropriate. Hovewer, the log transformation, ceteris paribus, is able to significantly affect if a variable is statistically significant or not. Here you can see by the Fixed effect estimator regression example.
In the first attempt I used NetInterstMargin (first variable named log_NetInterestMargin) and this variable is not statistically significant.

Code:

Fixed-effects (within) regression               Number of obs     =      1,197
Group variable: id                              Number of groups  =        109

R-squared:                                      Obs per group:
     Within  = 0.7067                                         min =          9
     Between = 0.0007                                         avg =       11.0
     Overall = 0.2757                                         max =         11

                                                F(21,1067)        =     122.40
corr(u_i, Xb) = -0.0924                         Prob > F          =     0.0000

-------------------------------------------------------------------------------------
       log_NPL_perc | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------------+----------------------------------------------------------------
  NetInterestMargin |  -.0170019   .0359333    -0.47   0.636    -.0875098    .0535061
 AvgEquityAvgAssets |  -.0340388   .0072217    -4.71   0.000    -.0482091   -.0198684
       CosttoIncome |  -.0038549   .0010209    -3.78   0.000     -.005858   -.0018518
               ROAA |  -.1165616   .0466176    -2.50   0.013    -.2080341   -.0250891
                LLP |  -1.21e-07   2.63e-07    -0.46   0.646    -6.36e-07    3.95e-07
             Assets |     26.958   10.27897     2.62   0.009     6.788705    47.12729
     deltabankloans |  -.0233581   .0057626    -4.05   0.000    -.0346655   -.0120507
       deltaFTSEMIB |   .0056594   .0008663     6.53   0.000     .0039596    .0073592
      RealGDPGrowth |   .0743855   .0041371    17.98   0.000     .0662677    .0825033
   deltaNCLDeposits |  -.0564172   .0032417   -17.40   0.000     -.062778   -.0500564
           dummy_25 |  -.1135303   .0980685    -1.16   0.247    -.3059594    .0788987
        dummy_50_75 |   .0063863    .059466     0.11   0.914    -.1102973      .12307
        dummy_25_50 |   -.165802   .0798618    -2.08   0.038    -.3225061    -.009098
       SIZE_25_ROAA |    .001881   .0640499     0.03   0.977    -.1237971    .1275591
       SIZE_50_ROAA |  -.0522251   .0774365    -0.67   0.500    -.2041703      .09972
       SIZE_75_ROAA |  -.0612316   .0615014    -1.00   0.320     -.181909    .0594457
   L1_RealGDPGrowth |   .1579416    .006119    25.81   0.000     .1459349    .1699482
   L2_RealGDPGrowth |    .090434    .005898    15.33   0.000      .078861    .1020071
    L1_deltaFTSEMIB |   .0041794   .0005927     7.05   0.000     .0030165    .0053423
  L1_deltabankloans |    -.06782   .0060265   -11.25   0.000    -.0796451   -.0559949
L1_deltaNCLDeposits |  -.0751297   .0032144   -23.37   0.000     -.081437   -.0688224
              _cons |  -2.004619   .1169712   -17.14   0.000    -2.234139   -1.775099
--------------------+----------------------------------------------------------------
            sigma_u |  .65158704
            sigma_e |   .3224297
                rho |  .80330051   (fraction of variance due to u_i)
-------------------------------------------------------------------------------------
F test that all u_i=0: F(108, 1067) = 26.96                  Prob > F = 0.0000

Here I used the log transformation (first variable named log_NetInterestMargin)

Code:

Fixed-effects (within) regression               Number of obs     =      1,197
Group variable: id                              Number of groups  =        109

R-squared:                                      Obs per group:
     Within  = 0.7077                                         min =          9
     Between = 0.0102                                         avg =       11.0
     Overall = 0.2452                                         max =         11

                                                F(21,1067)        =     123.05
corr(u_i, Xb) = -0.1402                         Prob > F          =     0.0000

---------------------------------------------------------------------------------------
         log_NPL_perc | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
----------------------+----------------------------------------------------------------
log_NetInterestMargin |  -.1176943   .0573763    -2.05   0.040    -.2302774   -.0051111
   AvgEquityAvgAssets |  -.0319452   .0071842    -4.45   0.000    -.0460419   -.0178484
         CosttoIncome |  -.0040174   .0010115    -3.97   0.000    -.0060022   -.0020325
                 ROAA |  -.1093108   .0466724    -2.34   0.019    -.2008909   -.0177306
                  LLP |  -8.41e-08   2.63e-07    -0.32   0.749    -5.99e-07    4.31e-07
               Assets |   31.68001   10.45083     3.03   0.002      11.1735    52.18651
       deltabankloans |  -.0262004   .0056822    -4.61   0.000    -.0373499   -.0150508
         deltaFTSEMIB |   .0058295   .0008614     6.77   0.000     .0041394    .0075197
        RealGDPGrowth |   .0737951   .0041224    17.90   0.000     .0657063     .081884
     deltaNCLDeposits |  -.0576963   .0032084   -17.98   0.000    -.0639918   -.0514008
             dummy_25 |  -.1096407   .0978907    -1.12   0.263    -.3017208    .0824395
          dummy_50_75 |   .0115595   .0594111     0.19   0.846    -.1050163    .1281352
          dummy_25_50 |  -.1628232   .0797169    -2.04   0.041    -.3192428   -.0064035
         SIZE_25_ROAA |  -.0033336   .0639285    -0.05   0.958    -.1287733    .1221062
         SIZE_50_ROAA |  -.0571957    .077265    -0.74   0.459    -.2088043    .0944128
         SIZE_75_ROAA |  -.0649643   .0613895    -1.06   0.290    -.1854221    .0554935
     L1_RealGDPGrowth |   .1595236   .0060378    26.42   0.000     .1476762    .1713709
     L2_RealGDPGrowth |   .0894354   .0058843    15.20   0.000     .0778893    .1009814
      L1_deltaFTSEMIB |   .0040979   .0005889     6.96   0.000     .0029423    .0052535
    L1_deltabankloans |  -.0646026   .0057948   -11.15   0.000    -.0759731   -.0532321
  L1_deltaNCLDeposits |  -.0744212    .003186   -23.36   0.000    -.0806728   -.0681697
                _cons |  -1.987116    .101714   -19.54   0.000    -2.186698   -1.787534
----------------------+----------------------------------------------------------------
              sigma_u |  .67220621
              sigma_e |  .32182958
                  rho |  .81352599   (fraction of variance due to u_i)
---------------------------------------------------------------------------------------
F test that all u_i=0: F(108, 1067) = 26.62                  Prob > F = 0.0000

To sum up, What would recommend me to do with the transformation? Do you think this approach is reasonable and what would you do in a situation in which the log transformation does not seem to make a variable distribution normal or solve the skewness but at the same time a not significant variable may turn to be significant with the log transformation?

In this other variable case, the log transformation seems more appropriate and the variable is in both case significant. Therefore, I would say that the log transformation is appropriate-. Do you agree with me?

Code:

 tabstat log_AvgEquityAvgAssets AvgEquityAvgAssets,stats (sk)

   Stats |  log_Av~s  AvgEqu~s
---------+--------------------
Skewness |  -.346894   1.60862
------------------------------

In more general terms, do you think that the combination of hist, tabstat, and sktest together with plot dependent variable and independent variable represents a valid and appropriate approach and the right combination of instruments provided by Stata?

Thanks everybody who will help me. Greetings to everyone.
Kind Regards,

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

06 Aug 2022, 07:36

Salvatore:
who taught you that your regressand and regressors should follow a normal distribution (outside the textbooks)?
Normality is a weak requirements for the elements of the composed error terms (ui and eit) in panel data regression.
Logging the regressand can fix (sometimes) issues related to heteroskedasticity; however, as you have 109 panels, you are almost forced to call in clustered-robust standard errors, that hendle both heteroskedasticity amd/or autocorrelation of the epsilon term.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

06 Aug 2022, 12:21

To add to @Carlo Lazzaro's point, logarithm transformation makes the skewness much worse. as you say. This can be illuminated by a scatter plot of the original and logarithmic versions but perhaps outliers have been produced by the transformation, i.e. small values become very small.

An introductory text like Jeff Wooldridge's explains that transformations can be useful, but normality of marginal distribution isn't really a goal at all for what you're doing.
2 likes
Comment

Salvatore Greco

Join Date: Jun 2022
Posts: 41

07 Aug 2022, 05:30

Good morning Carlo and Nick. Thanks to both of your precious and helpful pieces of advice.

To Carlo and Nick:
I knew normality was not a strong requirement and that textbooks give the wrong impression that following a normal distribution is somehow "crucial". However, my request stemmed from the fact that by doing the log transformation, a variables p-value may change significantly even with robust standard errors. Let me show you.

Code:

xtreg log_NPL_perc NetInterestMargin AvgEquityAvgAssets CosttoIncome ROAA LLP Assets deltabankloans deltaFTSEMIB RealGDPGrowth deltaNC
> LDeposits dummy_25 dummy_50_75 dummy_25_50 SIZE_25_ROAA SIZE_50_ROAA SIZE_75_ROAA L1_RealGDPGrowth L2_RealGDPGrowth L1_deltaFTSEMIB L1
> _deltabankloans L1_deltaNCLDeposits, fe vce(cluster id)

In this case I use simply NetInterestMargin.
And these are the results:

Code:

Fixed-effects (within) regression               Number of obs     =      1,197
Group variable: id                              Number of groups  =        109

R-squared:                                      Obs per group:
     Within  = 0.7067                                         min =          9
     Between = 0.0007                                         avg =       11.0
     Overall = 0.2757                                         max =         11

                                                F(21,108)         =      63.28
corr(u_i, Xb) = -0.0924                         Prob > F          =     0.0000

                                          (Std. err. adjusted for 109 clusters in id)
-------------------------------------------------------------------------------------
                    |               Robust
       log_NPL_perc | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------------+----------------------------------------------------------------
  NetInterestMargin |  -.0170019   .0588895    -0.29   0.773     -.133731    .0997273
 AvgEquityAvgAssets |  -.0340388   .0174072    -1.96   0.053    -.0685429    .0004654
       CosttoIncome |  -.0038549   .0017252    -2.23   0.028    -.0072745   -.0004353
               ROAA |  -.1165616   .0576273    -2.02   0.046    -.2307889   -.0023343
                LLP |  -1.21e-07   4.24e-07    -0.28   0.777    -9.61e-07    7.20e-07
             Assets |     26.958   19.26195     1.40   0.165    -11.22252    65.13852
     deltabankloans |  -.0233581   .0064178    -3.64   0.000    -.0360792    -.010637
       deltaFTSEMIB |   .0056594   .0007129     7.94   0.000     .0042463    .0070725
      RealGDPGrowth |   .0743855    .003727    19.96   0.000     .0669979    .0817731
   deltaNCLDeposits |  -.0564172    .005069   -11.13   0.000    -.0664648   -.0463696
           dummy_25 |  -.1135303    .134461    -0.84   0.400    -.3800553    .1529946
        dummy_50_75 |   .0063863   .0857588     0.07   0.941    -.1636025    .1763752
        dummy_25_50 |   -.165802   .1027842    -1.61   0.110    -.3695382    .0379342
       SIZE_25_ROAA |    .001881   .0716246     0.03   0.979    -.1400913    .1438534
       SIZE_50_ROAA |  -.0522251   .0767989    -0.68   0.498    -.2044539    .1000036
       SIZE_75_ROAA |  -.0612316   .0644833    -0.95   0.344    -.1890488    .0665855
   L1_RealGDPGrowth |   .1579416   .0103784    15.22   0.000     .1373698    .1785133
   L2_RealGDPGrowth |    .090434    .005007    18.06   0.000     .0805092    .1003589
    L1_deltaFTSEMIB |   .0041794   .0004197     9.96   0.000     .0033474    .0050113
  L1_deltabankloans |    -.06782   .0058514   -11.59   0.000    -.0794184   -.0562216
L1_deltaNCLDeposits |  -.0751297   .0035868   -20.95   0.000    -.0822394   -.0680201
              _cons |  -2.004619   .2249334    -8.91   0.000    -2.450476   -1.558762
--------------------+----------------------------------------------------------------
            sigma_u |  .65158704
            sigma_e |   .3224297
                rho |  .80330051   (fraction of variance due to u_i)
----------------------------------------------------------------------------------

The p-value is 0.773

In this other case, I used the log transformation.

Code:

. xtreg log_NPL_perc log_NetInterestMargin AvgEquityAvgAssets CosttoIncome ROAA LLP Assets deltabankloans deltaFTSEMIB RealGDPGrowth del
> taNCLDeposits dummy_25 dummy_50_75 dummy_25_50 SIZE_25_ROAA SIZE_50_ROAA SIZE_75_ROAA L1_RealGDPGrowth L2_RealGDPGrowth L1_deltaFTSEMI
> B L1_deltabankloans L1_deltaNCLDeposits, fe vce(cluster id)

And these are the results.

Code:

Fixed-effects (within) regression               Number of obs     =      1,197
Group variable: id                              Number of groups  =        109

R-squared:                                      Obs per group:
     Within  = 0.7077                                         min =          9
     Between = 0.0102                                         avg =       11.0
     Overall = 0.2452                                         max =         11

                                                F(21,108)         =      56.90
corr(u_i, Xb) = -0.1402                         Prob > F          =     0.0000

                                            (Std. err. adjusted for 109 clusters in id)
---------------------------------------------------------------------------------------
                      |               Robust
         log_NPL_perc | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
----------------------+----------------------------------------------------------------
log_NetInterestMargin |  -.1176943   .1031952    -1.14   0.257    -.3222451    .0868565
   AvgEquityAvgAssets |  -.0319452   .0179631    -1.78   0.078    -.0675512    .0036609
         CosttoIncome |  -.0040174   .0016765    -2.40   0.018    -.0073406   -.0006942
                 ROAA |  -.1093108   .0557906    -1.96   0.053    -.2198974    .0012758
                  LLP |  -8.41e-08   3.91e-07    -0.21   0.830    -8.60e-07    6.92e-07
               Assets |   31.68001   18.00386     1.76   0.081    -4.006771    67.36678
       deltabankloans |  -.0262004   .0065948    -3.97   0.000    -.0392725   -.0131283
         deltaFTSEMIB |   .0058295   .0006945     8.39   0.000      .004453    .0072061
        RealGDPGrowth |   .0737951   .0037688    19.58   0.000     .0663247    .0812656
     deltaNCLDeposits |  -.0576963    .005105   -11.30   0.000    -.0678152   -.0475774
             dummy_25 |  -.1096407   .1359168    -0.81   0.422    -.3790513      .15977
          dummy_50_75 |   .0115595   .0868522     0.13   0.894    -.1605966    .1837155
          dummy_25_50 |  -.1628232   .1046981    -1.56   0.123     -.370353    .0447067
         SIZE_25_ROAA |  -.0033336   .0725813    -0.05   0.963    -.1472022    .1405351
         SIZE_50_ROAA |  -.0571957     .07599    -0.75   0.453    -.2078211    .0934296
         SIZE_75_ROAA |  -.0649643   .0646736    -1.00   0.317    -.1931587    .0632301
     L1_RealGDPGrowth |   .1595236   .0101262    15.75   0.000     .1394517    .1795954
     L2_RealGDPGrowth |   .0894354   .0051544    17.35   0.000     .0792184    .0996523
      L1_deltaFTSEMIB |   .0040979   .0004228     9.69   0.000       .00326    .0049359
    L1_deltabankloans |  -.0646026   .0058019   -11.13   0.000    -.0761029   -.0531023
  L1_deltaNCLDeposits |  -.0744212   .0036557   -20.36   0.000    -.0816674    -.067175
                _cons |  -1.987116   .1975562   -10.06   0.000    -2.378707   -1.595525
----------------------+----------------------------------------------------------------
              sigma_u |  .67220621
              sigma_e |  .32182958
                  rho |  .81352599   (fraction of variance due to u_i)
---------------------------------------------------------------------------------------

.

Now the p-value is 0.25.

Despite the fact that in this case, the variable is still not significant, the p-value has changed significantly and it may be the case that with a transformation like this one variable may even become statistically significant.
This is the point behind my previous statements and not the normal distribution itself. I was wrong in the way of expressing it.

Having said that, simply ignoring the reasoning related to the normal distribution approximation:

How would you suggest I should understand which can be a solid and appropriate transformation for one variable? Which command (test or graphs) would you recommend me to use?
Maybe scatterplot of the regressand and the regressor or anything else?

Thanks a lot for your help.
Kind regards,

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

07 Aug 2022, 06:05

I've already suggested

Code:

scatter log_NetInterestMargin NetInterestMargin

which may seem futile because of necessity all points will lie on the curve defining the relation. But it could be helpful in seeing whether points pop out on the left of the graph. In the same spirit

Code:

spikeplot NetInterestMargin spikeplot logNetInterestMargin

would show the fine structure of each distribution, Behind that I am no economist and quite unfit to comment on which predictors deserve inclusion as testing a theory and/or as quantifying effects that may exist but won't necessarily rate as significant at conventional levels. Nevertheless I wonder whether a much simpler model would be more helpful.
Comment
Salvatore Greco

Join Date: Jun 2022

Posts: 41
#6

07 Aug 2022, 08:02

Nick:
thanks for your help.
Since this is the first time I used the command scatter confronting to different transformation of the same variable, may I ask you what do you think about the result?
It seemed to me that at left bottom of the graph there are some "points" disconnected to the others. What comments would you make based on this evidence?
You will find the graph as an attachment.
Thanks a lot.
Regards,
Attached Files
Comment
Salvatore Greco

Join Date: Jun 2022

Posts: 41
#7

07 Aug 2022, 08:03

Nick:
thanks for your help.
Since this is the first time I used the command scatter confronting to different transformations of the same variable, may I ask you what do you think about the result?
It seemed to me that at the left bottom of the graph there are some "points" disconnected from the others. What comments would you make based on this evidence?
You will find the graph as an attachment.
Thanks a lot.
Regards,
Attached Files
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#8

07 Aug 2022, 16:35

As transformations -- including when they give puzzling results -- are a strong personal interest, I wanted to understand why logging changed the skewness from .211 to -2.151, which is quite a big change. The scatter plot doesn't show any strong outliers on either scale, so the answer must be more subtle, which is why I asked also to see

Code:

spikeplot NetInterestMargin spikeplot log_NetInterestMargin

It's not that I don't trust the results; I just want to see, and I want you to see, quite why they occur.
Comment
Salvatore Greco

Join Date: Jun 2022

Posts: 41
#9

08 Aug 2022, 04:59

Nick:
thanks for your feedback. Attached you will find what you need, as requested.
Thanks for your help.
Regards,
Salvatore

Attached Files
Comment

Announcement

variable transformation and impact on p-values of estimates.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment