Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Running regression with panel data but missing values of y

    I have a question about my regression on monthly data of 500 companies which I defined as panel data sorted by gvkey and date. From this data I would like to run a regression on variable x, where x is the sum of three lags of the variable count. For each month I have a variable count, but the variable y is only available 4 times per year (quaterly).

    The x var (qcount) is created by the following formula:
    gen qcount = l.count + l2.count + l3.count

    From some tests I assume that Stata excludes all observations with a missing value of x. This is indeed what I want, but what Stata also does, is excluding all observations of x if y has a missing value. As a result, qcount will not include all the lags I asked for.
    reg y qcount

    Example of my data:
    gvkey time y x
    123 1 . 12
    123 2 . 16
    123 3 3 10
    123 4 . 14

    It seems to look like Stata drops the observations at time 1,2,4 and therefore excludes the x variables as well.
    Does anyone have a solution to exclude the missing y variables while not excluding the x variables in this regression?

    Thanks in advance

    Best, Maarten
    Last edited by maartenc; 14 May 2014, 02:59.

  • #2
    Why are you surprised at this? Stata can't do anything here with missing values, which includes lagged values that are missing.

    Whether there is scope to fill in your missings with imputation (highly fashionable but overrated) or interpolation (centuries-old and underrated) is a different question.
    Last edited by Nick Cox; 14 May 2014, 04:21.

    Comment


    • #3
      I am not surprised and understand that this is the most convenient way for many people. However in my case it does not work out unfortunately and I don't know if there is another method to solve this problem. Hopefully someone can help me.

      Comment


      • #4
        Convenience is not the issue; what exactly do you expect that Stata might do instead given missing values?

        Comment


        • #5
          First let's make sure we are talking about the same thing(s) here. When you say

          Stata drops the observations at time 1,2,4 and therefore excludes the x variables as well.
          the fist part of this is true - Stata drops the observations. The second part, however, might be misleading, because Stat does not drop variables. The x values are used for any observation that has both non-missing x and non-missing y.

          You then ask about

          a solution to exclude the missing y variables while not excluding the x variables in this regression?
          I assume by variables you mean values (or observation with missing values on a particular variable). Nick partly answered this question. You might think about imputing the missing values using MI, interpolation (yes, now I might be blamed for sloppy terminology as interpolation might not be regarded as imputation), or any other method.

          There is not a way to relate (x) values to missing (y) values, and I wonder how you figure this should be possible? Maybe we do not get, what you are trying to do, and you might want to add more information and elaborate on what exactly you mean by

          in my case it does not work out
          Best
          Daniel

          Comment


          • #6
            Thanks for the replies.

            Imputation or interpolation is not really what I am looking for since I do not want to fill these missing y values.

            Hopefully I can make my question more clear:

            1) With x variables I mean observations, sorry for this. As you mention Stata does indeed the following: The x values are used for any observation that has both non-missing x and non-missing y. However I want to use the x values for any observation that has non-missing x, but also for missing y.

            2) I want to include all observations of x even when the y value of this observation is missing. The reason for this is that I want to use the cumulative lagged value of x (gen qcount = l.count + l2.count + l3.count)

            I hoped that it was possible to add an option in the regression formula, but another possible solution might be: saving the values of qcount as fixed values (which can not change when another variable changes) and then running the regression on this fixed value. Or using the command: drop if y == . and then running the regression on this fixed value.

            I someone knows the command to 'fix' these x values and 'de-link' them from the original formula, the problem might be solved. (In excel this is simply copying the column and only paste the values)




            Comment


            • #7
              What you seem to be reaching for is a home-grown variety of imputation or interpolation, so it's surprising that you also say that you don't want to do that.

              The essence is simple: if you want missing values to be overwritten by non-missing values from somewhere else, that's imputation or interpolation, regardless of whether it's a described flavour of either in some literature.

              It's still unclear to me what values are to be copied where and what the rationale is. Being able to copy values from one part of your dataset is not itself sufficient to justify doing it.

              Comment


              • #8
                I think you're asking a very different question from the one Nick is answering, but you are not sufficiently clear. It might help to show your code.

                If you are doing gen qx = l.x + l2.x +l3.xyour data will look like this:
                Code:
                gvkey time y x qx
                123 1 . 12 .
                123 2 . 16 .
                123 3 3 10 .
                123 4 . 14 52
                123 5 2 10 62
                If you then do
                Code:
                reg y qx
                the regression will use case 5. Thus it is not dropping the x that goes into the calculation of qx, if that calculation is done beforehand. It is however dropping cases where qx is missing because the timeseries isn't yet long enough, and it will drop cases where y is missing.

                Comment


                • #9
                  Maarten,

                  Let me take a stab at trying to understanding what it is you are trying to do:

                  You have quarterly y values (say at time t) which you would like to relate to an x which is the sum of the 3 count values at t-1, t-2, and t-3. However, you would also like to relate y to earlier x values for which y is missing, such as count[t-2]+count[t-3]+count[t-4] , count[t-3]+count[t-4]+count[t-5] , and count[t-4]+count[t-5]+count[t-6].

                  Is this correct so far?

                  My first question would be whether this is even appropriate because of the overlap between the four x values that you hope to relate to a single y. Instead, you might consider whether regressing y against the lagged counts (l.count l2.count l3.count) might be more appropriate.

                  If you believe this approach is appropriate, you could regress y against the lagged x values (regress y x l.x l2.x l3.x). This will have the same number of observations as before (one for each non-missing y) but will allow you to relate that y to multiple x's.

                  A less appealing alternative (as per Nick's concern in post #7) would be to fill in missing y values with the latest non-missing y value. This is not technically difficult but may not be statistically appropriate.

                  I hope this helps at least illuminate what you are hoping to do. I am not a time-series expert, so once your problem is better understood, perhaps others can recommend the best solution from a statistical standpoint.

                  Regards,
                  Joe

                  Comment


                  • #10
                    Thank you all for trying to help and understand me! Brendan, you understood my question correct. However, when I do a test to check if the qx variable kept the same value, I get different results.
                    Let me explain my generated variables, regression, the test and how I would like to run the regression step by step. Hopefully this will help to understand my problem.

                    1) Variables
                    My y variable is SUE and is observed each quarter over 5 years.
                    My x variable is count and is observed each month over 5 years. (articles published per month)
                    gen qCount = l.count + l2.count + l3.count

                    2) Regression
                    sort gvkey time reg SUE qCount
                    3) Test: Did qcount include the values I need? (l.count + l2.count + l3.count)
                    gvkey time y x qx 123 1 . 12 . 123 2 . 16 . 123 3 3 10 . 123 4 . 14 38 123 5 2 10 40 123 6 . 21 34 123 7 4 15 46 123 8 . 14 50 If qx will not change when running the regression, I should get different results when I do the following:
                    drop if SUE == . reg SUE qCount However both regression show the same results and if I look the the data I see that qcount changed after dropping missing values of SUE.
                    (The observations included in the regression are 7005 in both regression.)
                    gvkey time y x qx 123 3 3 10 . 123 5 2 10 . 123 7 4 15 . 123 8 . 14 35
                    4) How should my regression look like?
                    I would like to run the regression on non-missing values of y using the qx, defined at the start. Since I want to regress the data on the x variable of the last 3 months, qx on time=8 for example should be qx(8)=x(7)+x(6)=x(5) and not qx(8)=x(7)+x(5)=x(3)

                    @Joe, I think you almost got me right. I already performed the regression you suggested, but I hope that a sum of the lagged x values will give me better results. In this regression, I get a very low adj Rsquare of -0.0001. When I drop the observations the regression includes only 23 observations instead of 7005.

                    Hopefully this explanation makes you able to understand my question more clearly

                    Best, Maarten



                    Comment


                    • #11
                      Maarten,

                      You're starting to get out of my league, but I'm curious about something. You said before that the x values are monthly and that y values are quarterly. Accordingly, I would have expected that the subset with y values would be 25% of the total, not 0.3% (23/7005). How many companies do you have? How many observations per company? Can you explain how there are so few observations with non-missing x and y values?

                      Regards,
                      Joe

                      Comment


                      • #12
                        An illustration with similar data (only one quarter per ID):
                        Code:
                        . input gvkey time y x
                        
                                 gvkey       time          y          x
                          1. 123 1 . 12
                          2. 123 2 . 16
                          3. 123 3 3 10
                          4. 124 4 . 14
                          5. 124 5 . 10
                          6. 124 6 2 21
                          7. 125 7 . 15
                          8. 125 8 . 14
                          9. 125 9 1 14
                         10. end
                        
                        .
                        . bysort gvkey (time): gen qx = x + x[_n-1] + x[_n-2]
                        (6 missing values generated)
                        
                        . list, sepby(gvkey)
                        
                             +----------------------------+
                             | gvkey   time   y    x   qx |
                             |----------------------------|
                          1. |   123      1   .   12    . |
                          2. |   123      2   .   16    . |
                          3. |   123      3   3   10   38 |
                             |----------------------------|
                          4. |   124      4   .   14    . |
                          5. |   124      5   .   10    . |
                          6. |   124      6   2   21   45 |
                             |----------------------------|
                          7. |   125      7   .   15    . |
                          8. |   125      8   .   14    . |
                          9. |   125      9   1   14   43 |
                             +----------------------------+
                        
                        . reg y qx
                        
                              Source |       SS       df       MS              Number of obs =       3
                        -------------+------------------------------           F(  1,     1) =    0.93
                               Model |  .961538462     1  .961538462           Prob > F      =  0.5122
                            Residual |  1.03846154     1  1.03846154           R-squared     =  0.4808
                        -------------+------------------------------           Adj R-squared = -0.0385
                               Total |           2     2           1           Root MSE      =   1.019
                        
                        ------------------------------------------------------------------------------
                                   y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                        -------------+----------------------------------------------------------------
                                  qx |  -.1923077    .199852    -0.96   0.512    -2.731668    2.347053
                               _cons |   10.07692   8.414379     1.20   0.443     -96.8379    116.9917
                        ------------------------------------------------------------------------------

                        Comment


                        • #13
                          Thanks again and apologizes for the unclear examples. The format changed after I posted my post. I didn't use the tags cor

                          Unfortunately, I don't have acces to Stata at the moment, but I will test it again tomorrow.

                          Brendan, your post is really helpful, but I am curious what the result would be when the company with gvkey 123 in the example below would include more observations. Will the qx in that case only be calculated when y is non-missing?
                          Code:
                           .list sepby, (gvkey)  ​     +----------------------------+      | gvkey   time   y    x   qx |      |----------------------------|   1. |   123      1   .   12    . |   2. |   123      2   .   16    . |   3. |   123      3   3   10   38 | ​  4. |   123      4   .   11   37 |      |----------------------------|   5. |   124      5   .   14    . |   6. |   124      5   .   10    . |   7. |   124      7   2   21   45 |      |----------------------------|   8. |   125      8   .   15    . |   9. |   125      9   .   14    . |   10.|   125      10  1   14   43 |      +----------------------------+
                          Will observation 4. have a value of 37 of qx or will this be empty?

                          Furthermore, I can't see from this regression what value of qx was used in the regression. Were it the original values: 38,45 43 or only the values with both non-missing y and x e.g. 10, 21, 14?
                          If you insert the commands below, will this still give you the same regression results and will qx be the same?
                          Code:
                          drop if y == . reg y qx
                          In my case, qx changed but the regression gave the same results.

                          I will try to execute the command you used, which includes bysort, hopefully this will help.
                          In my regression I created the variable with the command
                          Code:
                          gen qCount = l.count + l2.count + l3.count
                          @Joe: I will have a look at this more thoroughly tomorrow.

                          Best, Maarten
                          Last edited by maartenc; 14 May 2014, 12:14. Reason: Sorry I didn't manage to get the first code in the correct front.

                          Comment


                          • #14
                            If there are more observations, qx will continue to increment. The first two are missing because the sequence is too short for the lags until element 3.

                            Yes, the result is the same as if you "drop if missing(y)" before the regression.

                            l.count and count[_n-1] are just different ways of taking lags.

                            Comment


                            • #15
                              And what happens with qx after the command "drop if missing(y)"?
                              If this value changes but the regression gives the same result, this implies that Stata does the same when running the regression without dropping these values manually.
                              The value of qx used in the regression (which is in my case the same as the qx observable after dropping the missing y observations), is the sum of the lagged values of the last 3 observations from the data what is left after dropping these values. However, I want qx to be calculated as the sum of the three lagged values before the command "drop if missing(y)" is used.

                              Comment

                              Working...
                              X