Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Kolmogorov-Smirnov test vs regression to check for differences between groups of data?

    Hi

    I am writing my Master's thesis in international trade / strategies for sourcing intermediate inputs in production.

    Some of what I do rely on comparing the productivity across mutually exclusive groups. However, finding a way to see if the differences in productivity are significant is not really straightforward. The two Methods I have seen used and that seem reasonable is either regressing productivity on a set of control variables for country and sector fixed effects, and including 1 dummy for each sourcing strategy. This would then compare the averages, and I can run F-tests to check whether they are different from each other.
    Another alternative is to check whether the distribution of productivity across firms are different, which I can use ksmirnov in Stata for. What would be the best alternative?

    Regression:
    Source | SS df MS Number of obs = 2,874
    -------------+---------------------------------- F(26, 2847) = 25.25
    Model | 132.835205 26 5.10904635 Prob > F = 0.0000
    Residual | 575.984262 2,847 .202312702 R-squared = 0.1874
    -------------+---------------------------------- Adj R-squared = 0.1800
    Total | 708.819467 2,873 .246717531 Root MSE = .44979

    ------------------------------------------------------------------------------
    tfp2008 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    Category |
    FO-EU | -.0181627 .0234888 -0.77 0.439 -.0642196 .0278941
    FO-NonEU | .0118199 .0256205 0.46 0.645 -.0384168 .0620566

    +Sector&Country dummies.

    Result of F-test
    ( 1) FO-EU - FO-NonEU = 0

    F( 1, 2847) = 2.26
    Prob > F = 0.1332


    K-S test:
    Smaller group D P-value
    ---------------------------------------------
    FO-EU: 0.0806 0.001
    FO-NonEU: -0.0120 0.846
    Combined K-S: 0.0806 0.001

    The K-S test suggests a difference in the distributions, but it does not account for the sector or country fixed effects. Running a regression without these dummies also lead to a significant differnence between the 2. Anyone have any ideas or thoughts on what is the best thing to do here?

  • #2
    Hello Jorgen,

    Welcome to the Stata Forum / Statalist.

    Please read the FAQ, particularly on how to present command and output.

    That said, there has been much discussion on the Kolmogorov-Smirnov test in terms of being the "best" approach - or not! - when evaluting departs from the normality assumption of distribution of the variables. "Classically" speaking, when p < 0.05, we could prefer to apply a non-parametric test, or try to transform the "abnormally distributed" variable.

    That being said, I believe the regression analysis shall not be taken as a "method" to be compared with the above-mentioned test. Among several reasons, under linear regression the pattern of distribution of the residuals is what matters most.

    Hope that helps.
    Best regards,

    Marcos

    Comment


    • #3
      I don't really know what I am actually thinking here...

      Comparing these means could (should?) just be done with a T-test? It is just a part of some preliminary descriptive statistics, so a more fleshed out comparison is coming later in the thesis.

      Comment


      • #4
        In principle, considering you have two groups, yes. Logically, depending on the study question and the study design, you may need to go further and perform more complex estimations.
        Best regards,

        Marcos

        Comment


        • #5
          Jorgen:
          I would go -regress- (that may well replace -ttest-). That way, you can also consider interacting predictors and looking for turning points, especially if the literature in your research field presents previous examples on those topics.
          As an aside to Marcos' wise recommendation to follow FAQ suggestions about presenting what you typed and what you got from Stata via CODE delimiters, I would advise you to consider -fvvarlist- for creating categorical variables and interactions (see -help fvvarlist-).
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Thanks everyone (and sorry about the output posting), I think I will go back to my original plan from before I confused myself then. Looking something like (this is the main categories, the ones in the first post are sub-categories but I am going to use the same method in these cases):

            Code:
            . regress tfp2008 i.sourcingmode i.countrycode i.sector [aweight = abs_weight] if coresample == 1, robust
            (sum of wgt is   1.3185e+05)
            
            Linear regression                               Number of obs     =      6,609
                                                            F(28, 6580)       =      54.23
                                                            Prob > F          =     0.0000
                                                            R-squared         =     0.2068
                                                            Root MSE          =     .44607
            
            ------------------------------------------------------------------------------
                         |               Robust
                 tfp2008 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            sourcingmode |
                     DI  |   .1153403   .0237628     4.85   0.000     .0687575    .1619231
                     FO  |   .1089231    .013679     7.96   0.000     .0821077    .1357384
                   DIFO  |   .2171586   .0271996     7.98   0.000     .1638385    .2704787
                     FI  |   .1931986   .0379686     5.09   0.000     .1187677    .2676294
                         |
             countrycode |
                    FRA  |  -.1755936   .1306205    -1.34   0.179    -.4316522    .0804651
                    GER  |    .053851   .1315343     0.41   0.682     -.203999    .3117009
                    HUN  |   .2626188   .1388147     1.89   0.059     -.009503    .5347407
                    ITA  |  -.2919814   .1301718    -2.24   0.025    -.5471604   -.0368025
                    SPA  |  -.1927592   .1301991    -1.48   0.139    -.4479916    .0624732
                     UK  |  -.2179037   .1320377    -1.65   0.099    -.4767404     .040933
                         |
                  sector |
                     17  |   .0026936   .0400086     0.07   0.946    -.0757363    .0811235
                     18  |  -.0596794   .0402524    -1.48   0.138    -.1385872    .0192284
                     19  |  -.2716894   .0397468    -6.84   0.000     -.349606   -.1937728
                     20  |  -.1006433   .0319838    -3.15   0.002     -.163342   -.0379447
                     21  |   .2835318   .0516278     5.49   0.000     .1823246     .384739
                     22  |   .1800425   .0343358     5.24   0.000     .1127331    .2473519
                     24  |   .3777267   .0345148    10.94   0.000     .3100664    .4453869
                     25  |   .1984099    .029358     6.76   0.000     .1408586    .2559612
                     26  |   .1100372    .031776     3.46   0.001     .0477459    .1723286
                     27  |   .0390129   .0392466     0.99   0.320    -.0379232    .1159489
                     28  |   .2294105   .0223998    10.24   0.000     .1854996    .2733215
                     29  |   .2213296   .0251081     8.82   0.000     .1721095    .2705496
                     31  |   .2540831    .037686     6.74   0.000     .1802063    .3279599
                     32  |   .7143731   .0684632    10.43   0.000      .580163    .8485832
                     33  |   .3062816   .0380423     8.05   0.000     .2317064    .3808569
                     34  |   .2419606    .048916     4.95   0.000     .1460693    .3378518
                     35  |   .3453713   .0466919     7.40   0.000       .25384    .4369027
                     36  |   -.086332   .0297338    -2.90   0.004    -.1446199   -.0280442
                         |
                   _cons |  -.1472797   .1317102    -1.12   0.264    -.4054745    .1109152
            ------------------------------------------------------------------------------
            
            .
            And then some robustness checks for the results by adding some other controls:
            Code:
            . regress tfp2008 i.sourcingmode logemp age i.countrycode i.sector [aweight = abs_weight] if coresample == 1, robust
            (sum of wgt is   1.3147e+05)
            
            Linear regression                               Number of obs     =      6,593
                                                            F(30, 6562)       =      74.61
                                                            Prob > F          =     0.0000
                                                            R-squared         =     0.2804
                                                            Root MSE          =     .41975
            
            ------------------------------------------------------------------------------
                         |               Robust
                 tfp2008 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            sourcingmode |
                     DI  |   .0650285   .0233922     2.78   0.005     .0191722    .1108847
                     FO  |   .0605353   .0134748     4.49   0.000     .0341203    .0869502
                   DIFO  |   .1242859   .0286734     4.33   0.000     .0680766    .1804951
                     FI  |    .054934   .0318488     1.72   0.085       -.0075     .117368
                         |
                  logemp |   .1571146   .0089853    17.49   0.000     .1395004    .1747288
                     age |  -.0000443   .0002262    -0.20   0.845    -.0004878    .0003992
                         |
             countrycode |
                    FRA  |   .0782132   .1021576     0.77   0.444    -.1220489    .2784754
                    GER  |   .1544359   .1025233     1.51   0.132    -.0465432    .3554149
                    HUN  |   .4180705   .1118077     3.74   0.000      .198891      .63725
                    ITA  |  -.0010206   .1022086    -0.01   0.992    -.2013828    .1993416
                    SPA  |   .0809963   .1020018     0.79   0.427    -.1189605    .2809531
                     UK  |  -.1295069   .1032562    -1.25   0.210    -.3319226    .0729089
                         |
                  sector |
                     17  |   .0254487   .0343189     0.74   0.458    -.0418274    .0927248
                     18  |  -.0431684   .0387586    -1.11   0.265    -.1191479    .0328111
                     19  |  -.2610311    .040445    -6.45   0.000    -.3403165   -.1817457
                     20  |  -.0851981   .0297922    -2.86   0.004    -.1436004   -.0267958
                     21  |    .269844   .0507944     5.31   0.000     .1702705    .3694175
                     22  |   .2050224   .0329785     6.22   0.000     .1403738     .269671
                     24  |   .3341191   .0317628    10.52   0.000     .2718536    .3963845
                     25  |   .1794438   .0278933     6.43   0.000     .1247638    .2341238
                     26  |    .091864   .0305212     3.01   0.003     .0320326    .1516955
                     27  |  -.0103109   .0373028    -0.28   0.782    -.0834366    .0628148
                     28  |   .2342348    .021456    10.92   0.000      .192174    .2762956
                     29  |   .2024201   .0239545     8.45   0.000     .1554613    .2493788
                     31  |   .2317033   .0373618     6.20   0.000      .158462    .3049445
                     32  |   .6763927   .0654307    10.34   0.000     .5481273    .8046581
                     33  |   .2932517   .0363077     8.08   0.000     .2220766    .3644267
                     34  |   .1903422   .0472346     4.03   0.000      .097747    .2829375
                     35  |   .2721237   .0437411     6.22   0.000     .1863769    .3578706
                     36  |  -.0835762   .0303258    -2.76   0.006    -.1430247   -.0241277
                         |
                   _cons |  -.8996312   .1108689    -8.11   0.000     -1.11697   -.6822921
            ------------------------------------------------------------------------------
            So that I have the option to include even more controls later on.
            Last edited by Jorgen Steen; 08 Apr 2017, 02:39.

            Comment


            • #7
              Jorgen:
              as expected, -regress. seems the way to go.
              I do not think that including more controls is a way to check for the robustness of the first regression model; instead, I would take a look at previous examples reported in the literature of your research field.
              As an aside you should consider a postestimate investigation of the coefficient you got via, say, -parmtest-.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                I absolutely agree with Carlo. Just a small note: the aweights. I gather they are deemed necessary in this field/model. To end, under the scenario presented in #6, individual t-tests, definitely, apart from preliminary estimations, wouldn't provide the results you're longing for,
                Best regards,

                Marcos

                Comment

                Working...
                X