Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Near perfect correlation between dummy variable and outcome variable, all checks on data indicate there is no problem with underlying data.

    I am facing a bit of a problem and I can't seem to figure out what the issue might be. I am currently running regressions on average disposable income and employment rates between countries with a minimum wage, youth minimum wage and without. I have been running the following regression compares the agegroup below 25 and between 25-40 to see if there was a difference in these two outcome variable (as well as running fixed effect for country and year).

    Cross posted with Reddit: https://www.reddit.com/r/stata/comme...tiple_outcome/

    Code:
    reg avg_inc_2 inwagedummy youthwagedummy mw_y ymw_y i.*country_n#i.*year i.*age_groups if (age_groups==1|age_groups==2), cluster(country_n)
    reg m_employmentratio inwagedummy youthwagedummy mw_y ymw_y i.*country_n#i.*year i.*age_groups if (age_groups==1|age_groups==2), cluster(country_n)
    Whilst my results for the employment rates seem to be fine, when I run the same regression for average incomes, the results are completely off sink. I originally thought there might be an outlier problem but have done a few test, including generating a standard error variable, all observation are within (or just outside 3 SE of the mean). It seem in particular my inwagedummy variable is nearly perfect correlated (but again, when running a scatter plot graph everything seems fine). I have attached the regression table outputs to this post as well as some code below. If anyone has any ideas of what I might be doing wrong or not be doing it would be greatly appreciated.

    Note: Since I am really interested in looking at the avg_income between age_groups, I have repeated this regression with a variable representing the percentage difference from average disposable income of the age group 25-40, whilst the results are not significant, when I use this percentage variable, my issues of collinearity disappear.

    Thank you.

    Code:
    Linear regression                               Number of obs     =        942
                                                    F(2, 28)          =          .
                                                    Prob > F          =          .
                                                    R-squared         =     0.8651
                                                    Root MSE          =       11.7
    
                                          (Std. err. adjusted for 29 clusters in country_n)
    ---------------------------------------------------------------------------------------
                          |               Robust
        m_employmentratio | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    ----------------------+----------------------------------------------------------------
              inwagedummy |  -24.56775   2.732159    -8.99   0.000    -30.16432   -18.97117
           youthwagedummy |   22.57252   3.581028     6.30   0.000     15.23712    29.90792
                     mw_y |  -14.00416   5.464318    -2.56   0.016    -25.19731   -2.811015
                    ymw_y |  -5.223345   7.480217    -0.70   0.491    -20.54587    10.09919
    Code:
    Linear regression                               Number of obs     =        942
                                                    F(5, 28)          =          .
                                                    Prob > F          =          .
                                                    R-squared         =     0.9960
                                                    Root MSE          =     898.81
    
                                          (Std. err. adjusted for 29 clusters in country_n)
    ---------------------------------------------------------------------------------------
                          |               Robust
                avg_inc_2 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    ----------------------+----------------------------------------------------------------
              inwagedummy |  -32052.84    425.568   -75.32   0.000    -32924.58    -31181.1
           youthwagedummy |    1552.72   223.6783     6.94   0.000     1094.536    2010.904
                     mw_y |     770.16   851.1361     0.90   0.373    -973.3132    2513.633
                    ymw_y |   82.27966   934.0052     0.09   0.930    -1830.943    1995.503
                          |
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double country_n float(age_groups year) byte(inwagedummy youthwagedummy) float(mw_y ymw_y) double(m_employmentratio avg_inc_2)
    1 1 1 1 1 0 1       61.905171585    18383.638671875
    1 2 1 1 1 0 0        75.74068471      17202.3203125
    1 1 2 1 1 0 1       61.392848445     19168.96484375
    1 2 2 1 1 0 0  75.91077349666666  18253.72509765625
    1 1 3 1 1 0 1 61.208336079999995    19954.291015625
    1 2 3 1 1 0 0  76.02173251666666   19305.1298828125
    1 1 4 1 1 0 1       61.894535415      20739.6171875
    1 2 4 1 1 0 0  76.63595531666667  20356.53466796875
    2 1 1 0 0 0 0 53.079152300000004         21157.3125
    2 2 1 0 0 0 0        83.91216592         19720.9375
    2 1 2 0 0 0 0        51.85248355       21655.984375
    2 2 2 0 0 0 0  83.66375077999999       20229.453125
    2 1 3 0 0 0 0       51.713445715        22154.65625
    2 2 3 0 0 0 0  84.41059247333334        20737.96875
    2 1 4 0 0 0 0        50.86173013       22653.328125
    2 2 4 0 0 0 0  84.56957792333334       21246.484375
    3 1 1 1 1 0 1 28.895548146499998    18363.091796875
    3 2 1 1 1 0 0  81.70424055333334         18829.4375
    3 1 2 1 1 0 1      29.2099718905  18795.86865234375
    3 2 2 1 1 0 0  80.37262525999999       19314.953125
    3 1 3 1 1 0 1 28.732249623999998   19228.6455078125
    3 2 3 1 1 0 0  80.26854347666666        19800.46875
    3 1 4 1 1 0 1      26.7154846145  19661.42236328125
    3 2 4 1 1 0 0  79.71293698666666       20285.984375
    4 1 1 1 0 1 0 56.314160165000004    20919.876953125
    4 2 1 1 0 0 0  80.44234257333333     21245.01171875
    4 1 2 1 0 1 0        56.36382499    21663.576171875
    4 2 2 1 0 0 0  80.21062512333333    22499.021484375
    4 1 3 1 0 1 0        57.42928177     20234.41796875
    4 2 3 1 0 0 0  80.13730665666667      21533.2734375
    4 1 4 1 0 1 0       58.028286165    19462.677734375
    4 2 4 1 0 0 0  80.83274671666668       20466.453125
    5 1 1 1 0 1 0       35.327390609   5241.46337890625
    5 2 1 1 0 0 0  79.09015721333334     5013.427734375
    5 1 2 1 0 1 0      33.3712065415    5358.9814453125
    5 2 2 1 0 0 0  79.23279219999999  5135.900634765625
    5 1 3 1 0 1 0      31.5679606975   5476.49951171875
    5 2 3 1 0 0 0         79.9887269   5258.37353515625
    5 1 4 1 0 1 0 29.887344158999998   5664.02099609375
    5 2 4 1 0 0 0  79.29896395666667  5440.704833984375
    6 1 1 0 0 0 0  65.42606062499999     21517.95703125
    6 2 1 0 0 0 0  84.11411576333333     23682.38671875
    6 1 2 0 0 0 0 61.762080735000005      22030.3078125
    6 2 2 0 0 0 0        83.97361708 24371.433984375002
    6 1 3 0 0 0 0       63.570820595 22542.658593750002
    6 2 3 0 0 0 0  83.47532549666666        25060.48125
    6 1 4 0 0 0 0        59.55431093       23055.009375
    6 2 4 0 0 0 0  82.35377544999999 25749.528515625003
    7 1 1 1 0 1 0       35.283146985   3940.94287109375
    7 2 1 1 0 0 0        74.69602227    4142.5947265625
    end
    label values country_n country1
    label def country1 1 "Australia", modify
    label def country1 2 "Austria", modify
    label def country1 3 "Belgium", modify
    label def country1 4 "Canada", modify
    label def country1 5 "Czech Republic", modify
    label def country1 6 "Denmark", modify
    label def country1 7 "Estonia", modify
    label values age_groups age_groups_lbl
    label def age_groups_lbl 1 "15-24", modify
    label def age_groups_lbl 2 "26-39", modify
    label values year year_n
    label def year_n 1 "2000", modify
    label def year_n 2 "2001", modify
    label def year_n 3 "2002", modify
    label def year_n 4 "2003", modify
    Last edited by Hugo Cooke; 23 Oct 2021, 13:28.

  • #2
    Note: Originally upload regression tables here but now inputed in the post. As such, I have removed them to make page cleaner to read.
    Last edited by Hugo Cooke; 23 Oct 2021, 13:32.

    Comment


    • #3
      The regression outputs were not, in fact, attached. In any case, it is best not to attach them: better is to copy/paste them directly into the Forum editor between code delimiters (just like -dataex- output).

      In addition, the example data you have shown will probably not be helpful in troubleshooting this because you show data from only one country. Consequently the inwagedummy variable is a constant in the example data, and there is nothing useful that anyone can say about it.

      Please post back with example data that includes several different countries (perhaps just showing three years worth of data for each of those), and show the regression outputs you are concerned about.

      Added: Crossed with #2. The regression outputs are not visible. But a more workable set of example data is truly needed still: the regression in question, carried out on the example, has all of the variables other than ymw_y omitted due to being constants.
      Last edited by Clyde Schechter; 23 Oct 2021, 13:06.

      Comment


      • #4
        Cross-posted on Reddit. Please note our policy on cross-posting in the FAQ Advice, which is that you are asked to tell us about it.

        Comment


        • #5
          Thank you clyde, I have made the changes you recommended by editing the post. Hopefully this is what you meant.

          Comment


          • #6
            Hi Nick, sorry, I will re read this segment on the FAQ's. I thought I remembered on the first read that cross posting across sites was ok but will make sure to note next time. Have also edited post.
            Last edited by Hugo Cooke; 23 Oct 2021, 13:28.

            Comment


            • #7
              What's happening here is that you are putting in a large number of country#year terms. In the example data, you have 50 observations, and altogether you have 36 predictor variables in the model. That is massive overfitting and it is not surprising that you are seeing a huge, unrealistic R2. Since the number of predictors in this model will be roughly proportional to the number of countries, I imagine that in your full data the same overfitting problem arises.

              In fact, although your eye has been caught by the results for inwagedummy, if you remove that variable from the model and rerun it, you get almost identical results. I can't give you an explanation for why the inwagedummy variable, in particular, draws such a large coefficient in this model, but I don't think that's really the important issue anyway. The real problem here is the massive overfitting of the model.

              A potential solution is to drop all those interactions, perhaps include year indicators if you think they are really necessary, and handle the country effects by using a random effects model. (You cannot estimate the inwagedummy effect in a country-fixed-effects model because inwagedummy is constant within each country.)

              Comment

              Working...
              X