Estimation sample used in random effects regression, does random effects weight data?

John Adler

Join Date: Apr 2017

Posts: 173
#1

Estimation sample used in random effects regression, does random effects weight data?

21 Mar 2018, 04:26

I have a conceptual question on random effects regressions, I have panel data of unemployment and health outcomes in the same mothers analysed at 3 Waves, each five years apart.

Somebody suggested that I build an estimation sample for my analysis that comprises any mothers who appeared in Wave 1 and at least 1 other Wave.

However, I'm finding the logic behind this a bit difficult to understand. The analysis in my paper is considered across a combination of Waves, i.e. Regression 1 of unemployment on health is in Waves 1,2 and 3, but Regression 2 of unemployment on health only considers health data that was measured in Waves 1 and 2, i.e. the health outcomes in this regression weren't measured in Wave 3.

However, the estimation sample for both regressions is the exact same.

To make things clearer, in both regressions above the estimation sample includes any combination of mothers measured in Waves 1, 2 and 3, Waves 1 and 2, and Waves 1 and 3, with the health outcome measures in Regression 2 only recorded in Waves 1 and 2.

So in the estimation sample included in Regression 2 I have mothers in Waves 1 and 2 but I also include mothers with unemployment and health measured in Wave 1 and then the next measured outcomes and characteristics I have for them is Wave 3, where the health outcome that is considered Regression 2 isn't even measured.

Although I was told to use this sample, conceptually it doesn't make any sense to me. How in the world can I include individuals who only have the health outcome I am measuring in Wave 1 and who's next measures were recorded after Wave 2 (the Wave I am considering as my end Wave in Regression 2), which is a Wave that doesn't even measure this health outcome?

I was wondering if there is something about random effects regressions that allow us to include people who were only included in one of the time points considered, as in the case above anyone included had a health outcome for at least one time point?

To consider this I re-ran the analysis above with a new estimation sample only for those mothers who appeared in all the Waves analyzed, so Regression 2 changed to only mothers measured in both Waves 1 and 2. The results are very similar so I thought that maybe random effects regressions allowed individuals recorded only once to be mixed in with individuals recorded twice to add something to the analysis, maybe in a weighted way using a similar approach to an inverse probability model?

I attach a screenshot from the paper I'm writing to provide a better explanation of this.

Grateful for any input

To note, this question was also posted here:

https://stackoverflow.com/questions/...level-analysis
Tags: panel data, random effect, regression, syntax, theory
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

22 Mar 2018, 10:12

You'll increase your chances of a helpful answer by following the FAQ on asking questions - provide Stata code, readable Stata output, and sample data using dataex. Also, do not post pictures or attach files.

Part of the reason you didn't get a quick answer is that you have a very long, confusing question. How you have regression on waves 1 and 2 have the same number of observations as regression on waves 1, 2, and 3 is not clear. I don't know what you're doing but it looks odd. How you use a LPM (?linear probability model?) with unemployment rates is likewise unclear.
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#3

22 Mar 2018, 12:16

Dear Phil,

Thank you for your response and correct assessment of where my issues lie. I think my query can be condensed down into a simple question, in a random effects regression across 2 waves like below:

Code:

xtreg health_y age_y income_y if wave!=3 & inwave3==0, cluster (current_county_y1) re robust

When respondents are in 1 but not both waves are these respondents included in the random effects regression? Or are they dropped? i.e. do only individuals with 2 waves of data get included in a random effects regression or is any wave for which they have data included? Also is there any way to include individuals who only have one wave and would this even make sense?

Very best,

John
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10260

23 Mar 2018, 02:57

When respondents are in 1 but not both waves are these respondents included in the random effects regression? Or are they dropped? i.e. do only individuals with 2 waves of data get included in a random effects regression or is any wave for which they have data included?

In panel data, you have two sources of variation: within (variation over the T dimension) and between (variation across the N dimension). The fixed effects estimator only considers the within variation in the data and therefore, you need at least 2 observations per individual. The random effects estimator on the other hand considers both the within and between variation, and therefore individuals with only 1 observation do contribute to the estimation (because there is between variation). It is easy to check this as illustrated below:

Code:

. webuse grunfeld

. keep if year> 1951
(170 observations deleted)

. drop if year< 1954 & company> 5
(10 observations deleted)

. list, sepby(company)

     +--------------------------------------------------+
     | company   year   invest   mvalue   kstock   time |
     |--------------------------------------------------|
  1. |       1   1952    891.2   4924.9   1430.5     18 |
  2. |       1   1953   1304.4   6241.7   1777.3     19 |
  3. |       1   1954   1486.7   5593.6   2226.3     20 |
     |--------------------------------------------------|
  4. |       2   1952    645.5   2159.4    444.2     18 |
  5. |       2   1953      641   2031.3    623.6     19 |
  6. |       2   1954    459.3   2115.5    669.7     20 |
     |--------------------------------------------------|
  7. |       3   1952    157.3   2079.7    726.1     18 |
  8. |       3   1953    179.5   2371.6    800.3     19 |
  9. |       3   1954    189.6   2759.9    888.9     20 |
     |--------------------------------------------------|
 10. |       4   1952      145      727    290.6     18 |
 11. |       4   1953   174.93   1001.5    346.1     19 |
 12. |       4   1954   172.49    703.2    414.9     20 |
     |--------------------------------------------------|
 13. |       5   1952     85.4    359.4    729.3     18 |
 14. |       5   1953     91.9    398.4    774.3     19 |
 15. |       5   1954    81.43    365.7    804.9     20 |
     |--------------------------------------------------|
 16. |       6   1954   135.72    927.3    238.7     20 |
     |--------------------------------------------------|
 17. |       7   1954    89.51    192.7    511.3     20 |
     |--------------------------------------------------|
 18. |       8   1954     68.6   1188.9    213.5     20 |
     |--------------------------------------------------|
 19. |       9   1954    49.34    474.5      468     20 |
     |--------------------------------------------------|
 20. |      10   1954     5.12    58.12    14.33     20 |
     +--------------------------------------------------+

Here, companies 6-10 have only 1 observation, so no within variation. We will see that estimating our model including and excluding these observations changes the random effects estimates but not the fixed effects estimates (i.e., random effects coefficients change but not fixed effects coefficients). This is enough to tell you that individuals with only 1 observation matter for the random effects estimator.

Code:

*ALL DATA
xtreg invest mvalue kstock
est store RE1

xtreg invest mvalue kstock, fe
est store FE1


*DROP FIRMS WITH ONLY 1 OBSERVATION
drop if company> 5

xtreg invest mvalue kstock
est store RE2

xtreg invest mvalue kstock, fe
est store FE2

esttab RE1 RE2 FE1 FE2, drop(_cons)

Code:

. esttab RE1 RE2 FE1 FE2, drop(_cons)

----------------------------------------------------------------------------
                      (1)             (2)             (3)             (4)   
                   invest          invest          invest          invest   
----------------------------------------------------------------------------
mvalue              0.127**         0.117*          0.128           0.128   
                   (2.94)          (1.98)          (1.40)          (1.40)   

kstock              0.344**         0.455**         0.499*          0.499*  
                   (2.67)          (3.10)          (3.09)          (3.09)   
----------------------------------------------------------------------------
N                      20              15              20              15   
----------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Dropping the observations affects the RE estimates (columns 1&2) but not FE (columns 3& 4)

Comment

John Adler

Join Date: Apr 2017

Posts: 173
#5

23 Mar 2018, 06:49

Dear Andrew,

Thank you so much for your clear analysis of a complex concept,

Could I ask you to elaborate a little further on how respondents in panel data are contributing to the analysis when they only appear in the first of two waves for example? In simple terms, how is their presence in this first wave adding information to the relationship we are looking at?

For example, in a random effects regression across 2 waves like below:

Code:

xtreg health_y age_y income_y, re

When respondents are in 1 but not both waves, how does their presence in 1 wave contribute to an overall analysis of employment change and health across 2 waves?

Thank you for your input,

John
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10260
#6

23 Mar 2018, 10:47

xtreg health_y age_y income_y, re

So your model is

$$\text{Health}_{it} = \beta_{1}\text{age}_{it} + \beta_{2}\text{income}_{it} + u_{it}\;\;\left(i=1,\cdots, N; \; t=1,\cdots, T\right)$$

where $i$ denotes individuals, $t$ denotes time and $u$ is the error term.

Could I ask you to elaborate a little further on how respondents in panel data are contributing to the analysis when they only appear in the first of two waves for example? In simple terms, how is their presence in this first wave adding information to the relationship we are looking at?

Consider the following:

1) Let's say I compare a low income individual to a higher income individual. I expect that the one with a higher income will be healthier because, on average, higher income individuals are more healthy (i.e., the coefficient $\hat{\beta}_{2}$ is positive).

2) Assume that I compare an individual at the beginning of the sample and the same individual years later. I expect that her health will have deteriorated over time (i.e., $\hat{\beta}_{1}$ is negative).

These are two different types of information. 1 is cross-sectional information and 2 is time series information. I can retrieve information of type 1 from individuals observed only once in the dataset, so in that sense they are useful. There is a rarely used option be in xtreg which only considers the cross-sectional variation in the data. Its resulting estimator, referred to as the between groups estimator, is useful in considering the random effects model rather than an estimator in its own right. If you refer to Balestra and Nerlove's (1966) two-stage feasible GLS procedure for estimating the random effects model, the first stage involves obtaining the between groups and within groups estimators. Also, looking at the mixed cross-section-time-series estimator of Theil and Goldberger(1961), it is expressed as a weighted average of the between and within groups estimators, where these are weighted inversely with respect to their variances.

References

Balestra, P., Nerlove, M. (1966). Pooling cross-section and time series data in the estimation of a dynamic model: the demand for natural gas. Econometrica 34: 585–612.

Theil, H. and Goldberger, A.S. (1961). On pure and mixed statistical estimation in economics. International Economic Review 2(1): 65 – 78
1 like
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#7

23 Mar 2018, 13:16

Dear Andrew,

Thank you for your response, after reading the referenced articles and some online discussions linked below, my understanding is basically that a random effects regression uses respondents who are in some but not all waves of data along with respondents with full data to estimate the mean value of the IV or DV's we might be interested in at each wave, does that seem reasonable?

I'm surprised that this doesn't present a problem and can only assume that maximum likelihood makes this possible, would I be correct? Similarly, does the fact that I use a linear probability model for binary health as above screw this up any?

Is it as simple as just including all variables whenever they exist in the data or is there some form of discounting applied under the hood? i.e. maybe some shrinkage estimator is being applied (aka partial pooling) which will push extreme values towards the mean values across the entire sample for groups of observations that are missing a lot of values so that their higher variance doesn't provide misleading results?

Any thoughts you have on this would be a great help,

Best,

John

References:

http://www.theanalysisfactor.com/lin...-post-studies/

The analysisfactor provides a very interesting example of pre- and post test scores in this kind of model

https://stats.stackexchange.com/ques...ed-effect-mode

Delaney, H. D., & Maxwell, S. E. (2004). Designing experiments and analyzing data. London, England. Chapter 15

The above chapter states that:

Random effects models accommodate missing data without having to exclude all participants for whom complete data are not available

A serious limitation of the multivariate approach to repeated measures is that it requires complete data from every participant. Any individual who is missing even a single score must be discarded from the analysis. This obviously lowers statistical power and precision
by lowering the sample size. Further, results can be biased unless data are missing completely at random, which essentially means that factors responsible for missingness are entirely unrelated to the dependent variable.

Fortunately, the maximum likelihood mixed model offers advantages over the multivariate approach in both respects. This advantage accrues because the mixed model does not require complete data. Instead, as many observations as are available for any individual are entered into the analysis. For example, when a = 4, some individuals may have been observed at only 1, 2, or 3 occasions. Their scores are simply entered into the analysis along with individuals without missing data. Thus, the mixed model will typically provide greater power and precision than the multivariate approach when some data are missing.

The mixed model approach also requires a less stringent assumption about missingness than does the multivariate approach. Specifically, the mixed model approach assumes that all factors that contribute to missingness are captured by observed scores. We will say more later in the chapter about this assumption, often called "missing at random" as opposed to "missing completely at random."
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10260
#8

24 Mar 2018, 04:27

Let us first consider what parameters you are estimating when running a random effects regression

1. $\beta$

2. The variances of the error components (call them $\sigma^{2}_{u}$ and $\sigma^{2}_{\eta}$)

You need only 1 cross section to identify $\beta$, but you require the panel to identify the variances of the error components. Because $\sigma^{2}_{u}$ and $\sigma^{2}_{\eta}$ are unknown, you need to estimate them either using a 2 stage procedure or jointly by maximum likelihood. I presented the two possibilities in #6 i.e. (i) two stage FGLS (Balestra and Nerlove) which is the default in xtreg, re, and (ii) Theil-Goldberger mixed estimator. Maximum likelihood can be implemented using xtreg, ml. There is nothing special about estimating the model using maximum likelihood because you still have to specify two likelihoods (i.e., the within likelihood and the between likelihood). Separate maximization of these just amounts to within-group (fixed effects) estimation and between-group estimation. So the only take in this is that when you are running a random effects regression, under the hood you are running both a fixed effects regression and a between effects regression and then weighting these regressions in some way to obtain the random effects coefficients. You just need some basic matrix algebra to understand the procedure, probably not so much more than derivation of OLS. You will find the two-stage procedure and ML estimation in any of the standard panel data econometrics books.

Now your question on which observations (or individuals) are included is basic. After the regression, just run gen sample= e(sample) and this directly gives you that information. The second point is that if your dependent variable is binary, from a specification point of view, you are better advised to look at xtprobit or xtlogit.
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#9

23 Jul 2018, 12:50

Dear Andrew,

I must apologize, as I never responded to this to thank you for the help you shared with me here, which really contributed to my understanding of this topic,

Very best,

John
Comment

Announcement