Unbalanced panel data

Laurence Vedelsdal

Join Date: May 2019
Posts: 9

Unbalanced panel data

10 May 2019, 16:31

Hi Statalists

I have an unbalanced panel data set of countries in the period 2002-2015 and I want to explain:
Dep. var: Total Entrepreneurship Activity (tea) across a set of groups.
Groups: Development stages (based on GDP)
Ind. var: Economic freedom (property_rights, government_integrity, tax_burden, government_spenditure, fiscal_health, business_freedom labor_freedom monetary_freedom trade_freedom investment_freedom financial_freedom)
Control: Unemployment, potentially GDP with the groups

The prior hypothesis is that the different variables of economic freedom vary in importance of boosting entrepreneurship depending on the stage of develpoment for a coutry.
To investigate this hypothesis I have modelled every variable seperately to overcome multicollinearity, using fixed effects or random effects - most Hausman tests point to fe, but some turn out negative and other produce the following error: (v_b-v_b is not positive definite) stata.

I have applied two variations:
1. Including main effects and the interactions term:
Here, I am comparing development stage 2 and 3 to rest, respectively.

Code:

. xtreg tea c.property_rights##i.stage_of_dev_3 unemployment, fe

Fixed-effects (within) regression               Number of obs     =        571
Group variable: stage_of_~3c                    Number of groups  =          3

R-sq:                                           Obs per group:
     within  = 0.1423                                         min =         88
     between = 0.3803                                         avg =      190.3
     overall = 0.1300                                         max =        278

                                                F(6,562)          =      15.53
corr(u_i, Xb)  = 0.0416                         Prob > F          =     0.0000

--------------------------------------------------------------------------------------------------
                             tea |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------------------------+----------------------------------------------------------------
                 property_rights |   1.957176   .8453467     2.32   0.021     .2967506    3.617601
                                 |
                  stage_of_dev_3 |
                              2  |   2.932931   3.202161     0.92   0.360    -3.356734    9.222595
                              3  |  -3.210716   3.449857    -0.93   0.352    -9.986905    3.565473
                                 |
stage_of_dev_3#c.property_rights |
                              2  |  -2.624536   .8740177    -3.00   0.003    -4.341276   -.9077955
                              3  |   -1.46376     .87046    -1.68   0.093    -3.173513    .2459919
                                 |
                    unemployment |  -.2270513   .0404756    -5.61   0.000    -.3065532   -.1475495
                           _cons |   11.64089   3.131846     3.72   0.000      5.48934    17.79245
---------------------------------+----------------------------------------------------------------
                         sigma_u |  4.6860432
                         sigma_e |  5.3212088
                             rho |  .43678413   (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------------
F test that all u_i=0: F(2, 562) = 30.61                     Prob > F = 0.0000

2. The interactions term:
Here, I extract an effect for each stage, so by itself more desireable, but I am not aware of potential flaws for this method.

Code:

. xtreg tea c.property_rights#i.stage_of_dev_3 unemployment, fe

Fixed-effects (within) regression               Number of obs     =        571
Group variable: stage_of_~3c                    Number of groups  =          3

R-sq:                                           Obs per group:
     within  = 0.1219                                         min =         88
     between = 0.1388                                         avg =      190.3
     overall = 0.0757                                         max =        278

                                                F(4,564)          =      19.58
corr(u_i, Xb)  = -0.0542                        Prob > F          =     0.0000

--------------------------------------------------------------------------------------------------
                             tea |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------------------------+----------------------------------------------------------------
stage_of_dev_3#c.property_rights |
                              1  |    1.61722   .3570775     4.53   0.000     .9158554    2.318584
                              2  |  -.1800884   .1853185    -0.97   0.332     -.544087    .1839103
                              3  |   .0394411    .148583     0.27   0.791    -.2524025    .3312848
                                 |
                    unemployment |  -.2391922   .0402427    -5.94   0.000     -.318236   -.1601483
                           _cons |   12.29739   1.006851    12.21   0.000     10.31975    14.27502
---------------------------------+----------------------------------------------------------------
                         sigma_u |  5.1240393
                         sigma_e |  5.3743468
                             rho |  .47617108   (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------------
F test that all u_i=0: F(2, 564) = 36.92                     Prob > F = 0.0000

Do you see one superior to the other - or have I overlooked another suitable model for the hypothesis?

Thanks for any advice,
Laurence

Last edited by Laurence Vedelsdal; 10 May 2019, 16:34.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#2

10 May 2019, 18:28

The second model is probably mis-specified because it does not include i.stage_of_dev_3 by itself. Consequently, this second model implicitly constrains the model to require that when property_rights = 0, the expected value of tea, adjusted for unemployment, is the same in all groups. In other words, while it allows for three separate slopes of property rights, one in each stage of development (so far so good), but it does not allow for differences in levels of tea among the three groups at baseline (probably not good).

Unless you believe that constraint is actually true of the real world, you should not use model 2. If you do believe it is true in the real world, is that consistent with the results form model 1, where you see some differences of levels of tea showing up at baseline? Those levels at baseline have very wide confidence intervals, so even though they might be negligibly small, there is also some chance they could be quite large: the data don't seem to give much precision about them.

You don't give any information about the scale for the property rights variable, but it's really key to understanding whether there are meaningful level differences in tea when property_rights == 0 among the development groups. If property_rights ranges from, say, 0 to 1, then the level differences in tea are of the same order of magnitude as the difference in tea associated with property rights freedom differences at the ends of the scale! You certainly couldn't treat those as negligible. On the other hand if property rights is measured on, say, a scale from 0 to 100 and values observed in the data come close to filling that range, then the level differences are very small compared to the potential differences in tea associated with realistic values of property rights, and could probably be considered negligible, which would justify using the second model.

Added: In the future, please give the threads you start a more informative title. The question you posed has almost nothing to do with the balanced or unbalanced nature of your panel data. You are asking about appropriate terms to include in the modeling of your data. Titles matter. While it is easy to have the illusion that you are engaging in a dialog with some person who responds to your question, that is not the case at all. This is a Forum where many people read along without asking or answering questions--they choose which threads to read based on the titles most of the time. More important, others come here searching specific topics they are interested in learning about. Those searches are focused on the titles. So if somebody else has a modeling issue similar to yours they won't be able to find this, and they will have missed the opportunity to see the solution here. And others who might actually have questions about unbalanced data in panels will have wasted their time coming to this thread.

Last edited by Clyde Schechter; 10 May 2019, 18:35.
Comment
Laurence Vedelsdal

Join Date: May 2019

Posts: 9
#3

11 May 2019, 00:47

Hi Clyde, thank you very much for your reply and thoughts. Could I potentially use the first model as the base model and draw the main conclusions and use the latter to indicate the direction of impact of property_rights on tea in each stage?

property_rights and the other economic freedom variables was originally measured on the scale 0-100, but to boost coefficients, I have devided these datapoints with a factor ten, so it is rescaled to 0-10.

Thank you for your point about the title. As I am not set on the above model(s) I thought it was appropriate as it regarded the suitable handling of such data.
I will try to be more precise in the future.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#4

12 May 2019, 13:13

Could I potentially use the first model as the base model and draw the main conclusions and use the latter to indicate the direction of impact of property_rights on tea in each stage?

I wouldn't. It's not the correct model. If you want the developmental stage specific marginal effects of property rights for each stage without having to do the calculations by hand (or as a series of -lincom-) commands, run

Code:

margins stage_of_dev3, dydx(property_rights)

after the first model.

The second model is a different model, one that imposes a constraint that conflicts too much with the results of the first model to use. Given the particular numbers in question, with property rights on a 0-10 scale, I would agree that it's a borderline situation here. The level effects of developmental stage are about 3 in either direction, compared to potential effects of around 20-30 from property rights itself. So the level effects are around 10-15% as big as the maximum property rights effect. That's arguably small, but I think not reasonably considered negligible. But even if you are willing to write of a 15% error as negligible, at best it will confuse your audience to throw numbers from two different models at them. Anyone paying attention and somewhat knowledgeable will spot the inconsistency.
Comment

Announcement

Unbalanced panel data

Comment

Comment

Comment