Running regression with panel data but missing values of y

maartenc

Join Date: May 2014

Posts: 10
#1

Running regression with panel data but missing values of y

14 May 2014, 02:41

I have a question about my regression on monthly data of 500 companies which I defined as panel data sorted by gvkey and date. From this data I would like to run a regression on variable x, where x is the sum of three lags of the variable count. For each month I have a variable count, but the variable y is only available 4 times per year (quaterly).

The x var (qcount) is created by the following formula:
gen qcount = l.count + l2.count + l3.count

From some tests I assume that Stata excludes all observations with a missing value of x. This is indeed what I want, but what Stata also does, is excluding all observations of x if y has a missing value. As a result, qcount will not include all the lags I asked for.
reg y qcount

Example of my data:
gvkey time y x
123 1 . 12
123 2 . 16
123 3 3 10
123 4 . 14

It seems to look like Stata drops the observations at time 1,2,4 and therefore excludes the x variables as well.
Does anyone have a solution to exclude the missing y variables while not excluding the x variables in this regression?

Thanks in advance

Best, Maarten

Last edited by maartenc; 14 May 2014, 02:59.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 36053
#2

14 May 2014, 04:07

Why are you surprised at this? Stata can't do anything here with missing values, which includes lagged values that are missing.

Whether there is scope to fill in your missings with imputation (highly fashionable but overrated) or interpolation (centuries-old and underrated) is a different question.

Last edited by Nick Cox; 14 May 2014, 04:21.
Comment
maartenc

Join Date: May 2014

Posts: 10
#3

14 May 2014, 04:31

I am not surprised and understand that this is the most convenient way for many people. However in my case it does not work out unfortunately and I don't know if there is another method to solve this problem. Hopefully someone can help me.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#4

14 May 2014, 04:43

Convenience is not the issue; what exactly do you expect that Stata might do instead given missing values?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3911
#5

14 May 2014, 04:53

First let's make sure we are talking about the same thing(s) here. When you say

Stata drops the observations at time 1,2,4 and therefore excludes the x variables as well.

the fist part of this is true - Stata drops the observations. The second part, however, might be misleading, because Stat does not drop variables. The x values are used for any observation that has both non-missing x and non-missing y.

You then ask about

a solution to exclude the missing y variables while not excluding the x variables in this regression?

I assume by variables you mean values (or observation with missing values on a particular variable). Nick partly answered this question. You might think about imputing the missing values using MI, interpolation (yes, now I might be blamed for sloppy terminology as interpolation might not be regarded as imputation), or any other method.

There is not a way to relate (x) values to missing (y) values, and I wonder how you figure this should be possible? Maybe we do not get, what you are trying to do, and you might want to add more information and elaborate on what exactly you mean by

in my case it does not work out

Best
Daniel
Comment
maartenc

Join Date: May 2014

Posts: 10
#6

14 May 2014, 05:49

Thanks for the replies.

Imputation or interpolation is not really what I am looking for since I do not want to fill these missing y values.

Hopefully I can make my question more clear:

1) With x variables I mean observations, sorry for this. As you mention Stata does indeed the following: The x values are used for any observation that has both non-missing x and non-missing y. However I want to use the x values for any observation that has non-missing x, but also for missing y.

2) I want to include all observations of x even when the y value of this observation is missing. The reason for this is that I want to use the cumulative lagged value of x (gen qcount = l.count + l2.count + l3.count)

I hoped that it was possible to add an option in the regression formula, but another possible solution might be: saving the values of qcount as fixed values (which can not change when another variable changes) and then running the regression on this fixed value. Or using the command: drop if y == . and then running the regression on this fixed value.

I someone knows the command to 'fix' these x values and 'de-link' them from the original formula, the problem might be solved. (In excel this is simply copying the column and only paste the values)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#7

14 May 2014, 06:01

What you seem to be reaching for is a home-grown variety of imputation or interpolation, so it's surprising that you also say that you don't want to do that.

The essence is simple: if you want missing values to be overwritten by non-missing values from somewhere else, that's imputation or interpolation, regardless of whether it's a described flavour of either in some literature.

It's still unclear to me what values are to be copied where and what the rationale is. Being able to copy values from one part of your dataset is not itself sufficient to justify doing it.
Comment
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#8

14 May 2014, 08:04

I think you're asking a very different question from the one Nick is answering, but you are not sufficiently clear. It might help to show your code.

If you are doing gen qx = l.x + l2.x +l3.xyour data will look like this:

Code:

gvkey time y x qx 123 1 . 12 . 123 2 . 16 . 123 3 3 10 . 123 4 . 14 52 123 5 2 10 62

If you then do

Code:

reg y qx

the regression will use case 5. Thus it is not dropping the x that goes into the calculation of qx, if that calculation is done beforehand. It is however dropping cases where qx is missing because the timeseries isn't yet long enough, and it will drop cases where y is missing.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#9

14 May 2014, 08:31

Maarten,

Let me take a stab at trying to understanding what it is you are trying to do:

You have quarterly y values (say at time t) which you would like to relate to an x which is the sum of the 3 count values at t-1, t-2, and t-3. However, you would also like to relate y to earlier x values for which y is missing, such as count[t-2]+count[t-3]+count[t-4] , count[t-3]+count[t-4]+count[t-5] , and count[t-4]+count[t-5]+count[t-6].

Is this correct so far?

My first question would be whether this is even appropriate because of the overlap between the four x values that you hope to relate to a single y. Instead, you might consider whether regressing y against the lagged counts (l.count l2.count l3.count) might be more appropriate.

If you believe this approach is appropriate, you could regress y against the lagged x values (regress y x l.x l2.x l3.x). This will have the same number of observations as before (one for each non-missing y) but will allow you to relate that y to multiple x's.

A less appealing alternative (as per Nick's concern in post #7) would be to fill in missing y values with the latest non-missing y value. This is not technically difficult but may not be statistically appropriate.

I hope this helps at least illuminate what you are hoping to do. I am not a time-series expert, so once your problem is better understood, perhaps others can recommend the best solution from a statistical standpoint.

Regards,
Joe
Comment
maartenc

Join Date: May 2014

Posts: 10
#10

14 May 2014, 09:50

Thank you all for trying to help and understand me! Brendan, you understood my question correct. However, when I do a test to check if the qx variable kept the same value, I get different results.
Let me explain my generated variables, regression, the test and how I would like to run the regression step by step. Hopefully this will help to understand my problem.

1) Variables
My y variable is SUE and is observed each quarter over 5 years.
My x variable is count and is observed each month over 5 years. (articles published per month)
gen qCount = l.count + l2.count + l3.count

2) Regression
sort gvkey time reg SUE qCount
3) Test: Did qcount include the values I need? (l.count + l2.count + l3.count)
gvkey time y x qx 123 1 . 12 . 123 2 . 16 . 123 3 3 10 . 123 4 . 14 38 123 5 2 10 40 123 6 . 21 34 123 7 4 15 46 123 8 . 14 50 If qx will not change when running the regression, I should get different results when I do the following:
drop if SUE == . reg SUE qCount However both regression show the same results and if I look the the data I see that qcount changed after dropping missing values of SUE.
(The observations included in the regression are 7005 in both regression.)
gvkey time y x qx 123 3 3 10 . 123 5 2 10 . 123 7 4 15 . 123 8 . 14 35
4) How should my regression look like?
I would like to run the regression on non-missing values of y using the qx, defined at the start. Since I want to regress the data on the x variable of the last 3 months, qx on time=8 for example should be qx(8)=x(7)+x(6)=x(5) and not qx(8)=x(7)+x(5)=x(3)

@Joe, I think you almost got me right. I already performed the regression you suggested, but I hope that a sum of the lagged x values will give me better results. In this regression, I get a very low adj Rsquare of -0.0001. When I drop the observations the regression includes only 23 observations instead of 7005.

Hopefully this explanation makes you able to understand my question more clearly

Best, Maarten
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#11

14 May 2014, 10:17

Maarten,

You're starting to get out of my league, but I'm curious about something. You said before that the x values are monthly and that y values are quarterly. Accordingly, I would have expected that the subset with y values would be 25% of the total, not 0.3% (23/7005). How many companies do you have? How many observations per company? Can you explain how there are so few observations with non-missing x and y values?

Regards,
Joe
Comment

Brendan Halpin

Join Date: Mar 2014
Posts: 152

#12

14 May 2014, 10:44

An illustration with similar data (only one quarter per ID):

Code:

. input gvkey time y x

         gvkey       time          y          x
  1. 123 1 . 12
  2. 123 2 . 16
  3. 123 3 3 10
  4. 124 4 . 14
  5. 124 5 . 10
  6. 124 6 2 21
  7. 125 7 . 15
  8. 125 8 . 14
  9. 125 9 1 14
 10. end

.
. bysort gvkey (time): gen qx = x + x[_n-1] + x[_n-2]
(6 missing values generated)

. list, sepby(gvkey)

     +----------------------------+
     | gvkey   time   y    x   qx |
     |----------------------------|
  1. |   123      1   .   12    . |
  2. |   123      2   .   16    . |
  3. |   123      3   3   10   38 |
     |----------------------------|
  4. |   124      4   .   14    . |
  5. |   124      5   .   10    . |
  6. |   124      6   2   21   45 |
     |----------------------------|
  7. |   125      7   .   15    . |
  8. |   125      8   .   14    . |
  9. |   125      9   1   14   43 |
     +----------------------------+

. reg y qx

      Source |       SS       df       MS              Number of obs =       3
-------------+------------------------------           F(  1,     1) =    0.93
       Model |  .961538462     1  .961538462           Prob > F      =  0.5122
    Residual |  1.03846154     1  1.03846154           R-squared     =  0.4808
-------------+------------------------------           Adj R-squared = -0.0385
       Total |           2     2           1           Root MSE      =   1.019

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          qx |  -.1923077    .199852    -0.96   0.512    -2.731668    2.347053
       _cons |   10.07692   8.414379     1.20   0.443     -96.8379    116.9917
------------------------------------------------------------------------------

Comment

maartenc

Join Date: May 2014

Posts: 10
#13

14 May 2014, 12:03

Thanks again and apologizes for the unclear examples. The format changed after I posted my post. I didn't use the tags cor

Unfortunately, I don't have acces to Stata at the moment, but I will test it again tomorrow.

Brendan, your post is really helpful, but I am curious what the result would be when the company with gvkey 123 in the example below would include more observations. Will the qx in that case only be calculated when y is non-missing?

Code:

.list sepby, (gvkey) +----------------------------+ | gvkey time y x qx | |----------------------------| 1. | 123 1 . 12 . | 2. | 123 2 . 16 . | 3. | 123 3 3 10 38 | 4. | 123 4 . 11 37 | |----------------------------| 5. | 124 5 . 14 . | 6. | 124 5 . 10 . | 7. | 124 7 2 21 45 | |----------------------------| 8. | 125 8 . 15 . | 9. | 125 9 . 14 . | 10.| 125 10 1 14 43 | +----------------------------+

Will observation 4. have a value of 37 of qx or will this be empty?

Furthermore, I can't see from this regression what value of qx was used in the regression. Were it the original values: 38,45 43 or only the values with both non-missing y and x e.g. 10, 21, 14?
If you insert the commands below, will this still give you the same regression results and will qx be the same?

Code:

drop if y == . reg y qx

In my case, qx changed but the regression gave the same results.

I will try to execute the command you used, which includes bysort, hopefully this will help.
In my regression I created the variable with the command

Code:

gen qCount = l.count + l2.count + l3.count

@Joe: I will have a look at this more thoroughly tomorrow.

Best, Maarten

Last edited by maartenc; 14 May 2014, 12:14. Reason: Sorry I didn't manage to get the first code in the correct front.
Comment
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#14

14 May 2014, 14:45

If there are more observations, qx will continue to increment. The first two are missing because the sequence is too short for the lags until element 3.

Yes, the result is the same as if you "drop if missing(y)" before the regression.

l.count and count[_n-1] are just different ways of taking lags.
Comment
maartenc

Join Date: May 2014

Posts: 10
#15

14 May 2014, 15:15

And what happens with qx after the command "drop if missing(y)"?
If this value changes but the regression gives the same result, this implies that Stata does the same when running the regression without dropping these values manually.
The value of qx used in the regression (which is in my case the same as the qx observable after dropping the missing y observations), is the sum of the lagged values of the last 3 observations from the data what is left after dropping these values. However, I want qx to be calculated as the sum of the three lagged values before the command "drop if missing(y)" is used.
Comment

Announcement

Running regression with panel data but missing values of y

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment