Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rolling regression or loops - missing values

    Dear All

    I have a peculiar issue I wish to share with you, in the hope of receiving some advice.
    I have the following panel dataset (unbalanced), with four variables:

    Industry-code Year Industry-sale Number of firm in industry
    12 2001 34014 5
    12 2002 35402 4
    12 2003 29473 5
    12 2004 . 5
    12 2005 29044 7
    12 2006 31024 7
    12 2007 32209 10
    12 2008 33218 9
    13 2004 5162 5
    13 2005 .
    13 2006 5234 6
    … … … …

    I have to run this regression:

    Industry-sale = a + year + error

    Specifically, I want this regression to be run for each year, based on data from the previous 5 years. In other words: I want to create a rolling regression for each industry-year, in the following way and under the following conditions:
    • within each industry, for each year calculate the regression: industry-sale = a + year + error
    Note: The calculation must be based on the information from the previous 5 years. For example, for the year 2008, the regression is based on data from years 2003, 2004, 2005, 2006, 2007. The window of 5 years is fixed and does not change, as the focal year of the regression moves ahead (i.e., 2002, 2003, etc.)
    • condition A: if in any of the previous 5 years, there is a missing value in either the depvar or indepvar, no estimate is given; that is, stata should return just a missing value;
    • condition B: if in any of the previous 5 years, the number of firms in industry is below 5 (i.e., 4 or less), then no estimate is given; that is, stata should return just a missing value.
    Once these regressions are calculated for the whole dataset, I wish the standard error of the coefficient “year” to be stored in the dataset. I am aware of the command rolling. For example, the command:

    rolling regress_SE = _se[year], window(5): regress industry-sale year

    does this job. Howevr, the problem with this command is that this command does not account for missing values. That is, if there is a missing value, it will calculate the regression over 4 years, whereas I wish stata to not calculate/store this regression estimate (condition A above); or, if the number of firms is below 5, this command will still run the regression, whereas I wish it to not calculate the regression (condition B above).

    Can anyone help me?
    Thanks





  • #2
    I am not aware of any command that will skip the regression under those conditions. But it easy enough to simply remove the unwanted results after the fact with a couple of lines of code.

    I note, by the way, that in the example data you give, there actually are no observations for which you would keep the regression results: there is always either a missing value among the 5 years, or a year where the number of firms is below 5. Presumably this is not the case in your full data.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte industry_code int year long industry_sale byte n_firms
    12 2001 34014  5
    12 2002 35402  4
    12 2003 29473  5
    12 2004     .  5
    12 2005 29044  7
    12 2006 31024  7
    12 2007 32209 10
    12 2008 33218  9
    13 2004  5162  5
    13 2005     .  .
    13 2006  5234  6
    end
    
    rangestat (reg) industry_sale year (min) n_firms, by(industry_code) interval(year -5 -1)
    foreach v of varlist reg_*r2 b_* se_* {
        replace `v' = . if reg_nobs < 5
        replace `v' = . if n_firms_min < 5
    }
    -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer, and is available from SSC.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.



    When asking for help with code, always show example data. When showing example data, always use -dataex-.


    Comment


    • #3
      I am not aware of any command that will skip the regression under those conditions.
      Correction. That's not true. -rangerun- could do that, but the code using -rangestat- shown above is much easier.

      Comment

      Working...
      X