Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Storing estimates from regressions by industry and year

    Hello everyone,

    I hope you are all doing well.

    Kindly, I need your help on the following (I found a similar post, but I couldn't customise the code posted to my data):

    I have a panel data ranging between 1990 till 2015. I need to loop regressions by year and industry and store the estimates in separate variables. The regression equation is Y = X1 + X2 + X3 + e.

    I am trying the following:

    Code:
    gen B1=.
    gen B2=.
    gen B3=.
    
    
    levelsof INDUSTRY, local(levels)
    
    foreach x of local levels {
    foreach z of numlist 1990/2015 {
    capture reg  Y X1 X2 X3 if INDUSTRY==`x' & YEAR==`z'
    if e(N) >= 10 {
                           replace B1=_b[X1] if e(sample)
                           replace B1=_b[X2] if e(sample)
                           replace B1=_b[X3] if e(sample)
            }
        }
    }
    Unfortunately, Stata is returning an error message: no variables defined

    Thank you for your suggestions.

    Mostafa

  • #2
    There isn't anything obviously wrong with the code you show. The message itself suggests that you did not -use- or otherwise create any data set in memory before you ran the code, or if you did, you ended up somehow dropping everything. I think once you get a data set in memory that contains the variables mentioned in the code it should run more or less correctly.

    That said, there is a flaw in the code that might trip you up once you get it running:

    -capture- is a potentially dangerous command because it gives a pass to any error whatsoever in the command or block of commands it applies to. Generally in this kind of repeated regression loop with coefficients being saved, the use of -capture- is intended to allow you to skip over combinations of INDUSTRY and YEAR that don't exist in the data. The problem is that if there is something else wrong with the regression command, it will also skip over that without telling you what happened. You may or may not ever find out about it. But if you do, it will likely be at some point in the future when you suddenly discover that some later result makes no sense because there are only missing values for some combinations of INDUSTRY and YEAR that shouldn't be missing! It might happen in the middle of a presentation when somebody in the audience notices a problem and calls you out on it!

    So -capture- should be used conservatively, or even avoided, using, instead, workarounds that are targeted to the specific problem you are trying to skip over. In this case there are a few different ways to do this.

    The first approach is to not generate combinations of INDUSTRY and YEAR that don't exist in the data in the first place. A different loop structure would accomplish this:

    Code:
    foreach z of numlist 1990/2015 {
        levelsof INDUSTRY if year == `z' & !missing(Y, X1, X2, X3), local(levels)
        foreach x of local levles {
            regress Y X1 X2 X3 if INDUSTRY == `x' & YEAR == `z'
            if e(N) >= 10 { 
    // etc.
    Another approach is to specifically trap the problem of no observations by testing for it before doing the regression
    Code:
    foreach x of local levels {
        foreach z of numlist 1990/2015 {
            quietly count if !missing(Y, X1, X2, X3) & INDUSTRY == `x' & YEAR == `z'
            if r(N) >= 10  { // THE MINIMUM NUMBER OF OBSERVATIONS YOU WILL ACCEPT
                regress Y X1 X2 X3 if INDUSTRY == `x' & YEAR == `z'
    // etc.
    A third approach is to constrain the -capture- structure by verifying that the -capture-d error condition is the one you were guarding against and not some other unexpected problem. The condition you want to skip over is when there are no observations available for the regression. The error code for that is 2000. So:

    Code:
    foreach x of local levels {
        foreach z of numlist 1990/2015 {
            capture regress Y X1 X2 X3 if INDUSTRY == `x' & YEAR == `z'
            if c(rc) == 0 { // THE REGRESSION PROCEEDED WITHOUT ERROR
                if e(N) > 10 {
                    replace B1 = _b[X1] if e(sample)
    // etc.
    // ...
                }
            }
            else if c(rc) != 2000 { // SOME PROBLEM OTHER THAN "no observations"
                display as error "Unexpected problem encountered for INDUSTRY `x' & YEAR `z'"
                display as error "Error code: `c(rc)'"
            }
        }
    }
    With this code, situations where there are no regressable observations with INDUSTRY `x' and YEAR `z' will be skipped over gracefully, but if there is some other problem with the regression command that prevents its execution, Stata will tell you about it with a nice red error message that includes Stata's error code before proceeding to the next combination of industry and year. Then you can go back to your data and figure out why you had a problem and what to do about it.

    Finally, I will point out that there is another flaw in your code that could trip you up. You -capture- the regression. Suppose we have a combination of YEAR and INDUSTRY for which there are no observations (or there is some other problem so that -regress- cannot run successfully). Stata proceeds to the next command, which is -if e(N) >= 10-. But because the regression was not successful, e(N) is undefined, and Stata will see -if >= 10- and halt the program, complaining about a syntax error. Had this happened to you, you would probably be quite puzzled as to why it ran the loop a number of times and then suddenly decided at some point that there is a syntax error. Bugs like this are hard to figure out and are another downside of using -capture- carelessly. Note that none of the three alternatives I have shown you above will have this problem because the -if e(N) >= 10- command either does not exist (the second version, where if r(N) >= 10 substitutes for it), or it is only reached if the regression has run successfully (the first and third).




    Comment


    • #3
      You might also want to look into the xtmg command written by Markus Eberhardt, which will do this for you if you xtset your data properly.

      Comment


      • #4
        Dear Clyde: thank you very much for your comprehensive reply and valuable suggestions that always work.

        I tried the codes and I prefer the second one. It's neat and does exactly what I want.

        I really appreciate your help.


        Dear Jesse: thank you for your suggestion, I look into the help file the command.

        Comment


        • #5
          Hello everyone,

          I hope you are all doing well.

          Kindly, I need your help on the following (I found a similar post, but I couldn't customise the code posted to my data):

          I have a panel data ranging between 2005 till 2021. I need to loop regressions by year and industry and store the Residuals, Coefficients and P-Values or T-Values or Standard Error in separate variables. The regression equation is Y = a + X1 + X2 + X3 + e.

          I am trying the following code and it works perfectly fine with getting the residuals value. But I am not very sure about how to modify this further to get the Coefficients and P-Values. Any help would be highly appreciated.


          forval y = 2005/2021 {
          forval i = 1/28 {
          display `y'
          display `i'
          capture reg Y X1 X2 X3 if year == `y' & industry == `i', noconstant
          if c(rc) == 0 {
          if e(N) >= 15 {
          predict r if `i' == industry & `y' == year, residuals
          replace discr = r if `i' == industry & `y' == year
          drop r
          }
          }
          local `i' = `i' + 1
          }
          local `y' = `y' + 1
          }

          Comment


          • #6
            #5

            The line

            Code:
            local `i' = `i' + 1
            is usually wrong as people mean

            Code:
            local i = `i' + 1
            but here it should just be deleted as forvalues does all that for you. Similar comment about the next local statement.

            Detail: The regression equation should be more like

            Y = a + b1 X1 + b2 X2 + b3 X3

            and (a matter of taste) I would write b0 not a. (Indeed, subscripts would look better).

            Last edited by Nick Cox; 02 Mar 2023, 02:39.

            Comment


            • #7
              Thanks Nick for pointing that out. I understood that.
              But Still I am struggling with how to loop regressions by year and industry and store the Residuals, Coefficients and P-Values or T-Values or Standard Error in separate variables.

              Comment


              • #8
                That said, for coefficients and P-values (all of them?), you're probably better off learning about statsby, asreg (SSC) or rangestat (SSC).

                Comment


                • #9
                  Thanks Nick. I will look into that.

                  Comment

                  Working...
                  X