Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Do I interpret estimator & control variables & R-squared correctly or not?

    Dear Statalist community

    These are my FE results of a regression of Government spending on International migration. 6 specifications show the regression with extra control variables: Unemployment, Urban population, Adjusted net income, Population aged >65, and Population aged <14 respectively.
    (1) (2) (3) (4) (5) (6)
    (i) International Immigration ct

    Observations
    R2
    0.074
    (0.049)

    659
    0.141
    0.160** (0.683)

    657
    0.188
    0.154* (0.847)

    657
    0.277
    0.158** (0.713)

    613
    0.314
    0.166***
    (0.603)

    613
    0.327
    0.165**
    (0.671)

    613
    0.337
    I would like to know whether I interpret these correctly or not.

    1. Since the significant result does not hold when the variable of unemployment is excluded in specification (1). This illustrates that unemployment and the international immigration variable are strongly correlated. This may reflect the fact that European unemployment are highly correlated with the number of foreigners entering the countries. Therefore, without an inclusion of unemployments, the estimator will be biased. Since the sign of unemployment coefficient is positive and there is a negative correlation between unemployment and immigration, these show a negative bias in coefficient of unemployment.
    2. Since Government spending is measured in % of GDP and International migration is measured in number of people, do I interpret the unit of estimator as a % point? eg. when the amount of immigrants increases by 1 unit (100,000 people), the government spending increases by 0.154% point


    One thing with specification(6) is that it is 'Population aged <14' is not significant but it makes variables Adjusted net income and Population aged >65 significant, I therefore include this to avoid problem of omitted variable bias. I'm not too sure whether I understand this correctly or not...

    3. It is worth mentioning that there are only 2 control variables that are significant in all specifications: unemployment and urban variables. I include other variables the (variable of adjusted net income, population aged 65 and above, and the population aged 14 and below) because more controls can reduce the chances of the omitted variable bias. This bias can cause significant coefficient to appear to be insignificant. The outcomes show that the coefficient of adjusted net income and population aged more than or equal to 65 becomes significant at 5% level after running the full regression. So, these variables are correlated with the immigration and should be included in the regression. However, as shown from the result, none of the variables are as important as unemployment variable. This is because it correlates with the variable of immigration, changing the coefficients of immigration by 0.086% point. The coefficients of immigration from specification (3) to (6) only slightly change.
    Lastly I am not too sure whether I interpret R-squared equation correctly or not.

    4. Even though R-squared increases continuously through each specification, this does not mean that specification (6) with the highest value of R-squared has the best goodness of fit. This value can be artificially high. For instance, the regression might include too many variables compared to the number of observations, leading to misleading interpretation of R-squared. It is, therefore, important to interpret R-squared with careful consideration. I therefore cooperate whether the standard error of the immigration coefficient is small or not into the interpretation. Since specification (5) has a reasonably high R-squared comparing with others’, the most significant immigration coefficient, and lowest standard error than the others — I focus on this specification for this investigation with the interaction term.
    Thank you
    Guest
    Last edited by sladmin; 02 May 2018, 08:12. Reason: anonymize poster

  • #2
    Guest:
    your post is difficult to follow.
    For the future, please post Stata code and output within CODE delimiters. Thanks.
    However, as far as your last point is concerned, I would look at adjusted R-squared to make comparison across different regression models.
    Last edited by sladmin; 02 May 2018, 08:12. Reason: anonymize original poster
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      1. Well, you can't really say that the unemployment and immigration variables are strongly correlated based on this. They must be correlated, for if they were not, the immigration coefficient would not change when unemployment is added. But you cannot infer the strength of the correlation for the amount of change in the immigration coefficient. Even if they are only weakly correlated, the change in the immigration coefficient could still be large if unemployment is also strongly associated with the government spending outcome.

      2. You should modify your wording so that you are not making a causal claim, because this is observational data. And in this instance it is not percentage points. So just say that an increase of 100,000 people in immigration is associated with an increase in spending equivalent to 0.154 percent of GDP.

      3. What is considered "worth mentioning" varies from discipline to discipline. As I am not an economist, what I would recommend might not be well received in your setting and you should check this advice with someone in your discipline. (Or perhaps one of the economists active on the Forum will chime in here.) In my view, the statistical significance of covariates introduced solely for purposes of adjustment and reduction of omitted variable bias should never be mentioned at all. It is about as irrelevant as you can get. Similarly, the criterion for whether to retain variables entered into the model for that purpose should never be based on the results for those variables themselves: they should be based on whether there is a material (not statistically significant but pragmatically material) change in the coefficient of the variables that you are interested in when you add them to the model. But, as I say, traditions vary by discipline.

      4. You are correct that the model with the highest R squared is not necessarily the best. It is possible, even easy, to overfit the noise in the data. In fact, if you just crank up the number of variables until it is equal to the number of observations, you will get R squared = 1, but such a model is completely meaningless and none of its findings would reproduce if you replicated the study. Some people like to rely on the adjusted R square statistic for this. It is based on R squared, but it is "penalized" for the number of variables included. The notion is that the model with the highest adjusted R squared will be best. Another approach is to use the Akaike or Bayes information criteria (-estat ic- command after -regress-) and select the model with the lowest value. Personally, I don't like selecting models based on any single statistic. I generally prefer to explore the relationship between predicted and observed values graphically, or the relationship between predicted values and residuals. Sometimes the model that has the best "average" or "overall" fit to the data works really well for some values of the predictors and poorly for others. But perhaps the values where it fits poorly are more important, or more common, or something like that. So I like to review the fit visually and also focus on which aspects of fit are important. (For example, in my discipline, epidemiology, a model designed to predict the risk of developing prostate cancer could easily be forgiven for performing poorly in men over age 85 because, although prostate cancer is actually very common in that age group, those men will almost all die of other things before the prostate cancer actually causes them any symptoms. If you were looking at a model to predict breast cancer risk in women, it would be very important to have it fit well for women in, say the 55-70 age range because that is when most breast cancers tend to occur and where treatment makes the most difference in outcomes.) I can tell you one thing that you should never do. Never select a model because it gives the most statistically significant result for some variable you are interested in. That's not science. It's cherry picking the data. It is increasingly being viewed as scientific misconduct. Don't do it!

      Added: Crossed with #2 which makes some of the same points.

      Comment


      • #4
        Dear Carlo and Clyde

        Thank you for your suggestion about adjusted R-squared. However, I am sorry that I forgot to mention that by 'R-squared' that I mention, it means 'within R-squared'.
        I am using panel data and the results are from FE estimator.

        In this case do I still need adjusted R-squared? Does my explanation makes sense?

        Comment


        • #5
          Dear Clyde

          Thank you for your detailed answers from questions 1 to 4! I appreciate it very much.
          Regarding question 1: If I want to see whether there is strong or weak correlation between unemployment and immigration, can I use this graph to see the correlation?
          Or there are other ways that are better for FE regression?
          Code:
          graph twoway (scatter unemployment immi) (lfit unemployment immi)


          Attached Files

          Comment


          • #6
            Guest:
            you can retrieve adjusted overall R2 [-e(r2 a)-] (but not within R2) after -xtreg-, as you can see from the following toy-example:
            Code:
            . use http://www.stata-press.com/data/r15/nlswork.dta
            (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
            
            . xtreg ln_wage age wks_ue, fe
            
            Fixed-effects (within) regression               Number of obs     =     22,807
            Group variable: idcode                          Number of groups  =      4,643
            
            R-sq:                                           Obs per group:
                 within  = 0.0825                                         min =          1
                 between = 0.0944                                         avg =        4.9
                 overall = 0.0655                                         max =         14
            
                                                            F(2,18162)        =     816.99
            corr(u_i, Xb)  = 0.0309                         Prob > F          =     0.0000
            
            ------------------------------------------------------------------------------
                 ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                     age |    .016458    .000411    40.05   0.000     .0156524    .0172635
                  wks_ue |  -.0018977   .0003395    -5.59   0.000    -.0025632   -.0012323
                   _cons |   1.178033   .0116525   101.10   0.000     1.155193    1.200873
            -------------+----------------------------------------------------------------
                 sigma_u |  .41074881
                 sigma_e |  .31048222
                     rho |  .63638554   (fraction of variance due to u_i)
            ------------------------------------------------------------------------------
            F test that all u_i=0: F(4642, 18162) = 6.50                 Prob > F = 0.0000
            
            . di e(r2_a)
            -.15205168
            Regression implies that you want to regress the dependent variable onto one or more predictors; correlation does not make this assumption.
            Besides, regression considers the contribution of each predictor (when adjusted to the other predictors) in explaining variation in the conditional mean of the dependent variable.
            Your screenshot (please, do not post them because they are screen space-consuming; post in -graph- format, instead) shows heteroskedasticity: hence, the fit does not seem that good.
            Whether or not this result is in line with your expectations, I cannot say.
            Last edited by sladmin; 02 May 2018, 08:12. Reason: anonymize poster
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Carlo:

              Thank you so much for your answer! That does answer what I am looking for.
              One last question, I look through the
              Code:
              help xttest
              for the adjusted R-squared of random effect estimator and can-not find it. Do you know how can I find it?

              Thank you
              Guest
              Last edited by sladmin; 02 May 2018, 08:12. Reason: anonymize poster

              Comment


              • #8
                Guest:
                stored results for -xtreg, re- do not include adjusted Rsq.
                Last edited by sladmin; 02 May 2018, 08:13. Reason: anonymize original poster
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Re #5. The graph you show is one way to do that. If you want to know how strongly the two variables are correlated, the most direct approach is to just use the -corr- command. Now, since you are doing -fe- regressions and interested in -wtihin- effects, the overall correlation is not really the right statistic. I think the simplest approach is to just do a fixed-effects regression of immigration on unemployment, with no other covariates and look at those results.

                  Comment


                  • #10
                    Dear Carlo and Clyde

                    Thank you so much for your answers! I have already considered all the suggestions

                    2. Since Government spending is measured in % of GDP and International migration is measured in number of people, do I interpret the unit of estimator as a % point? eg. when the amount of immigrants increases by 1 unit (100,000 people), the government spending increases by 0.154% point

                    Back to the question 2 that I mentioned about the interpretation relating to percentage point... Can you possibly explain why it should be percentage instead?
                    I am still stuck because I believe that the effect will be that the government spending will increase from the initial value by 0.154% point eg. from 10% to 10.154%.

                    Thank you
                    Guest
                    Last edited by sladmin; 02 May 2018, 08:13. Reason: anonymize poster

                    Comment


                    • #11
                      It's a peculiarity of English usage. The term "percentage point" refers to a difference between dimensionless percentages. But your variable is not a dimensionless percentage. It's "percent of GDP." So when something changes from 10% of GDP to 10.154% of GDP, it has increased by 0.154% of GDP. It's due to that "of GDP" that one doesn't use "percentage point." The purpose of "percentage point" when referring to a change in a percentage is to emphasize that the change is additive, not multiplicative. But the change you are referring to is in fact multiplied, by GDP.

                      Comment

                      Working...
                      X