Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression for each industry in each year in the sample

    Dear All,

    I am using Stata for my Master thesis. My study sample covers 15 years. There are 12 industry in the sample as well.

    I have tried many times to use a Stata command that allows me to to run the OLS regression for each industry and year in my data set. I will really appreciate your support on doing this.

    Thank you


  • #2
    Without an example of your data, the best that can be done for you is to show you some generic untested code based on some assumptions about your data that may or may not be true. I assume your data has a variable called industry, another variable called year, and that you have an outcome variable y that you wish to regress against predictor variables x1 x2 and x3. I should also point out that what you are asking is only possible if there are multiple observations in each industry in each year: if what you have is industry-year panel data, then you will get no results from 1 observation per industry-year combination.

    Code:
    capture program drop myregress
    program define myregress
        regress y x1 x2 x3
        foreach v of varlist x1 x2 x3 {
            gen b_`v' = _b[`v']
            gen se_`v' = _se[`v']
       }
       gen r2 = e(r2)
       gen n_obs = e(N)
       exit
    end
    
    runby myregress, by(industry year)
    Note: to run this you must install -runby- from SSC.

    The regression results (R2, N, and the regression coefficients and their standard errors) will now appear in the data set alongside the observations that participated in them.

    Added: if you don't want the results side-by-side with the original data, but just want a table of the results, add the following commands before the -exit- command:

    Code:
    keep industry year b_* se_* r2 n_obs
    keep in 1
    If you need more specific advice, post back and use -dataex- to show an example of your data. (See FAQ #12 if you are not familiar with the -dataex- command.)
    Last edited by Clyde Schechter; 04 Dec 2017, 16:19.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Without an example of your data, the best that can be done for you is to show you some generic untested code based on some assumptions about your data that may or may not be true. I assume your data has a variable called industry, another variable called year, and that you have an outcome variable y that you wish to regress against predictor variables x1 x2 and x3. I should also point out that what you are asking is only possible if there are multiple observations in each industry in each year: if what you have is industry-year panel data, then you will get no results from 1 observation per industry-year combination.

      Code:
      capture program drop myregress
      program define myregress
      regress y x1 x2 x3
      foreach v of varlist x1 x2 x3 {
      gen b_`v' = _b[`v']
      gen se_`v' = _se[`v']
      }
      gen r2 = e(r2)
      gen n_obs = e(N)
      exit
      end
      
      runby myregress, by(industry year)
      Note: to run this you must install -runby- from SSC.

      The regression results (R2, N, and the regression coefficients and their standard errors) will now appear in the data set alongside the observations that participated in them.

      Added: if you don't want the results side-by-side with the original data, but just want a table of the results, add the following commands before the -exit- command:

      Code:
      keep industry year b_* se_* r2 n_obs
      keep in 1
      If you need more specific advice, post back and use -dataex- to show an example of your data. (See FAQ #12 if you are not familiar with the -dataex- command.)

      Dear Prof. Schechter

      I hope you are doing fine.
      I have a question about the output of runby. First, thanks very much for the great command.
      As you see in the following, 1172 observations saved out of 1835. I understand that deleted observations were not included in the runby process, but I was wondering if there is any way to keep those observations after running the runby command.



      This is my code:
      ----------------------------------------------------------------------------------------------------

      Code:
      capture program drop myregress
      program define myregress
      levelsof FirmID, local(levels)
       foreach l of local levels {
       preserve
       drop if FirmID == `l'
       regress DeltaEPS_P Lag_DeltaEPS_P CRET
       restore
       replace b_Lag_DeltaEPS_P = _b[Lag_DeltaEPS_P] if FirmID == `l'
       replace b_CRET = _b[CRET] if FirmID == `l'
       replace b_cons = _b[_cons] if FirmID == `l'
       replace se_Lag_DeltaEPS_P = _se[Lag_DeltaEPS_P] if FirmID == `l'
       replace se_CRET = _se[CRET] if FirmID == `l'
       replace r2 = e(r2) if FirmID == `l'
       replace n_obs = e(N) if FirmID == `l'
      }
      
         exit
      end
      
      runby myregress, by(ff12industry year)
      ----------------------------------------------------------------------------------------------------

      Thanks very much indeed

      Comment


      • #4
        Yes, there is a way, but you shouldn't do it!

        Thirty-seven of your ff12industry year combinations led to an error in the program you are running. So instead of trying to keep the data, you should fix the program so that when you keep the data, it will be correct data. There is no good reason to keep wrong data. So what you need to do is find out what is causing the errors and fix the data or the program so that doesn't happen.

        Looking at your program, my guess would be that in those groups that are leading to errors, there are levels of FIrmID where the regression, which is restricted to firms other than that one, cannot be carried out either because there are no observations at all, or too few observations for a regression with two predictors. This is a foreseeable error. So you can do a few things. You can guard the regression (and all the copying of regression results that follows it) with an -if- command that tests for the presence of at least 3 observations (the smallest number for which regression results are possible). Alternatively, you can -capture- the regression and everything that follows it, and then check for error codes 2000 or 2001 (which are no observations and insufficient observations, respectively), and exiting with an error if the code is anything other than those two. That way, these foreseeable situations which are not really errors will be tolerated by your program without an error condition arising. But if there is something else wrong, then you will still get an error condition and it will be reported in the -runby- output, and the corresponding data will, appropriately, be absent from your results.

        If this is not the issue causing your problem, then you need to re-run the whole thing adding the -verbose- option to the -runby- command. That way you will get output from your program, and you will be able to see the specific error messages you are getting. Then you can troubleshoot those.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Without an example of your data, the best that can be done for you is to show you some generic untested code based on some assumptions about your data that may or may not be true. I assume your data has a variable called industry, another variable called year, and that you have an outcome variable y that you wish to regress against predictor variables x1 x2 and x3. I should also point out that what you are asking is only possible if there are multiple observations in each industry in each year: if what you have is industry-year panel data, then you will get no results from 1 observation per industry-year combination.

          Code:
          capture program drop myregress
          program define myregress
          regress y x1 x2 x3
          foreach v of varlist x1 x2 x3 {
          gen b_`v' = _b[`v']
          gen se_`v' = _se[`v']
          }
          gen r2 = e(r2)
          gen n_obs = e(N)
          exit
          end
          
          runby myregress, by(industry year)
          Note: to run this you must install -runby- from SSC.

          The regression results (R2, N, and the regression coefficients and their standard errors) will now appear in the data set alongside the observations that participated in them.

          Added: if you don't want the results side-by-side with the original data, but just want a table of the results, add the following commands before the -exit- command:

          Code:
          keep industry year b_* se_* r2 n_obs
          keep in 1
          If you need more specific advice, post back and use -dataex- to show an example of your data. (See FAQ #12 if you are not familiar with the -dataex- command.)

          Dear Prof. Schechter,

          I have a similar dataset where I want to perform a regression y x1 x2 x3 for every year-industry combination. And the above stata codes seem to work so thank you very much for that. However, can you maybe explain what STATA is actually doing, because I do not really understand what is happening in all the different commands.

          Thank you in advance.

          Comment


          • #6
            Code:
            // THESE FIRST TWO LINES TELL STATA TO CREATE A PROGRAM CALLED MYREGRESS
            capture program drop myregress
            program define myregress
            //  ALL OF THE INDENTED COMMANDS THAT FOLLOW ARE THE CONTENT OF THAT PROGRAM
                 levelsof FirmID, local(levels) // OBTAIN ALL VALUES OF VARIABLE FirmID
                 foreach l of local levels { // DO THE ENCLOSED FURTHER INDENTED COMMANDS FOR EACH FirmID
                     preserve // SAVE THE DATA TEMPORARILY
                     drop if FirmID == `l' // KEEP ONLY FirmIDs OTHER THAN THE CURRENT ONE
                     regress DeltaEPS_P Lag_DeltaEPS_P CRET // DO THE REGRESSION
                     restore // BRING BACK THE FULL DATA SET
                     // THESE REPLACE COMMANDS ALL SAVE THE COEFFICIENTS AND STANDARD ERRORS,
                     //  SAMPLE SIZE, AND R2 OF THE REGRESSION OUTPUT INTO THE OBSERVATIONS FOR THE CURRENT FirmID
                     replace b_Lag_DeltaEPS_P = _b[Lag_DeltaEPS_P] if FirmID == `l'
                     replace b_CRET = _b[CRET] if FirmID == `l'
                     replace b_cons = _b[_cons] if FirmID == `l'
                     replace se_Lag_DeltaEPS_P = _se[Lag_DeltaEPS_P] if FirmID == `l'
                     replace se_CRET = _se[CRET] if FirmID == `l'
                     replace r2 = e(r2) if FirmID == `l'
                     replace n_obs = e(N) if FirmID == `l'
                }
               exit // SELF EXPLANATORY
            end // END OF THE PROGRAM myregress
            
            runby myregress, by(ff12industry year) // SEE EXPLANATION OF -runby- BELOW
            The -runby- command takes the data set that is in memory and sorts it into subsets, each subset being defined by the combination of the values of ff12industry and year. It then "feeds" each subset to Stata as if it were a full data set and causes Stata to run the program myregress on that subset, and then stores the results. After all of the subsets have been processed, all of the stored results are stacked together and appear as a new data set in memory, replacing the data set that was there at the start.

            By the way, this was not originally my code. It is an adaptation of the code posted in #3 by the original poster of this thread. I would have written it somewhat differently. But as you are not comfortable with this code, I do not suggest modifying it.

            A brief summary of what this code does would be:
            For each combination of ff12industry and year, calculate for each firm the coefficients and standard errors of a regression of DeltaEPS_P on Lag_DeltaEPS_P and CRET carried out on all other firms in that industry and year, and save those results in the data set.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              Code:
              // THESE FIRST TWO LINES TELL STATA TO CREATE A PROGRAM CALLED MYREGRESS
              capture program drop myregress
              program define myregress
              // ALL OF THE INDENTED COMMANDS THAT FOLLOW ARE THE CONTENT OF THAT PROGRAM
              levelsof FirmID, local(levels) // OBTAIN ALL VALUES OF VARIABLE FirmID
              foreach l of local levels { // DO THE ENCLOSED FURTHER INDENTED COMMANDS FOR EACH FirmID
              preserve // SAVE THE DATA TEMPORARILY
              drop if FirmID == `l' // KEEP ONLY FirmIDs OTHER THAN THE CURRENT ONE
              regress DeltaEPS_P Lag_DeltaEPS_P CRET // DO THE REGRESSION
              restore // BRING BACK THE FULL DATA SET
              // THESE REPLACE COMMANDS ALL SAVE THE COEFFICIENTS AND STANDARD ERRORS,
              // SAMPLE SIZE, AND R2 OF THE REGRESSION OUTPUT INTO THE OBSERVATIONS FOR THE CURRENT FirmID
              replace b_Lag_DeltaEPS_P = _b[Lag_DeltaEPS_P] if FirmID == `l'
              replace b_CRET = _b[CRET] if FirmID == `l'
              replace b_cons = _b[_cons] if FirmID == `l'
              replace se_Lag_DeltaEPS_P = _se[Lag_DeltaEPS_P] if FirmID == `l'
              replace se_CRET = _se[CRET] if FirmID == `l'
              replace r2 = e(r2) if FirmID == `l'
              replace n_obs = e(N) if FirmID == `l'
              }
              exit // SELF EXPLANATORY
              end // END OF THE PROGRAM myregress
              
              runby myregress, by(ff12industry year) // SEE EXPLANATION OF -runby- BELOW
              The -runby- command takes the data set that is in memory and sorts it into subsets, each subset being defined by the combination of the values of ff12industry and year. It then "feeds" each subset to Stata as if it were a full data set and causes Stata to run the program myregress on that subset, and then stores the results. After all of the subsets have been processed, all of the stored results are stacked together and appear as a new data set in memory, replacing the data set that was there at the start.

              By the way, this was not originally my code. It is an adaptation of the code posted in #3 by the original poster of this thread. I would have written it somewhat differently. But as you are not comfortable with this code, I do not suggest modifying it.

              A brief summary of what this code does would be:
              For each combination of ff12industry and year, calculate for each firm the coefficients and standard errors of a regression of DeltaEPS_P on Lag_DeltaEPS_P and CRET carried out on all other firms in that industry and year, and save those results in the data set.
              Thank you for the quick and eleborate response, i now have a better understanding of the program. However, I only executed the simpel version in STATA, which is the following:

              capture program drop myregress
              program define myregress
              regress y x1 x2 x3 foreach v of varlist x1 x2 x3 {
              gen b_`v' = _b[`v']
              gen se_`v' = _se[`v']
              }
              gen r2 = e(r2)
              gen n_obs = e(N)
              exit
              end

              This one does not contain a lot of the replace and preserve/drop commands as in the code you mentioned above. I am wondering is the basic code I just mentioned enough? Just to give you an idea of what my dataset looks like: it is a panel dataset with the firm identifier ISIN over the years 2006-2020. Next to that, I also have an Industry identifier, which is a two digit SIC code. What I want to do is I want to do a regression for each industry year combination and thereafter I want to use the risiduals as a dependent variable in my subsequent regressions. Can I just execute the basic command above and than add: predict VariableName, residuals or should I use the advanced command with replace etc.

              Thank you for your time.
              Last edited by Tess Verschelden; 22 Sep 2021, 13:29.

              Comment


              • #8
                Well, there is a huge difference between your -myregress- program and the one in #6 and earlier. It is not merely a simpler version. They do different things.

                The one in #6 and earlier looks within each industry-year combination at each individual firm and calculates the regression from all the other firms in that industry-year combination. Your -myregress- just does the regression on the entire set of data for the industry-year and does not produce different results for each firm. The use of -preserve- and -restore-, all those -if- conditions, and -levelsof- are necessary to do what the earlier one does. They are not necessary for a single regression for each industry-year combination.

                So it depends on what you are trying to do.

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  Well, there is a huge difference between your -myregress- program and the one in #6 and earlier. It is not merely a simpler version. They do different things.

                  The one in #6 and earlier looks within each industry-year combination at each individual firm and calculates the regression from all the other firms in that industry-year combination. Your -myregress- just does the regression on the entire set of data for the industry-year and does not produce different results for each firm. The use of -preserve- and -restore-, all those -if- conditions, and -levelsof- are necessary to do what the earlier one does. They are not necessary for a single regression for each industry-year combination.

                  So it depends on what you are trying to do.
                  Well I am not sure which program I need. I am trying to do the following. I do want to generate the dependent variable Abnormal_CashFlows that is a proxy for real earnings management. Here I follow the model of Roychowdhury, S. (2006). What he does is, he first measures the 'normal' level of cash flows from operating activities (CFO), by regressing the following formula for each industry-year pair:

                  𝐶𝐹𝑂𝑖,𝑡 / 𝐴𝑖,𝑡−1 = 𝛼0 + 𝑎1 ( 1 / 𝐴𝑖,𝑡−1 ) + 𝛼2 ( 𝑆𝑖,𝑡 / 𝐴𝑖,𝑡−1 ) + 𝛼3 ( ∆𝑆𝑖,𝑡 / 𝐴𝑖,𝑡−1 ) + 𝜀𝑖,

                  For every firm-year, abnormal cash flow from operations is the actual CFO minus the ‘‘normal’’ CFO calculated using estimated coefficients from the corresponding industryyear model above. In other words, the abnormal cash flow is captured in the error term. As i need the error term of this regression I thus need to create an industryyear model. Do you have any idea which program is required to achieve this?

                  Thank you.

                  Comment


                  • #10
                    I'm afraid I can't help you on this aspect of it. I am an epidemiologist. My knowledge of finance and economics is pretty much limited to what I glean from posts here on Statalist that touch on those areas. I don't know what Roydhowdhury, S (2006) is--the forum FAQ does advise that if you mention references you should provide complete information. But even if you had provided a complete reference, I suspect it is in a journal that I would not have access to. I suggest you carefully read the methods in that paper and see if it clearly describes whether each industry-year gets a single regression covering all firms, or whether each firm in the industry-year gets its own regression based on all the other firms. If you cannot find that in the paper, then, as it is not really a statistical or Stata question, I think you would be most likely to get help by asking a colleague in your field.

                    Comment


                    • #11
                      I have seen such industry-year regressions but I dont remember what exactly is the procedure, as Clyde Schechter pointed I am not quite sure.

                      whether each industry-year gets a single regression covering all firms, or whether each firm in the industry-year gets its own regression based on all the other firms
                      . In one article related to such models, it is written that

                      "In an effort to overcome these problems, recent studies have begun to use cross-sectional versions of the models (e.g., Becker et al., 1998; Subramanyam, 1996; DeFond and Jiambalvo, 1994).7 Under this approach, the first stage regression is estimated separately for each industry-year combination, after which the resulting industry- and time-specific parameter estimates are combined with firm-specific data to generate estimated discretionary accruals. In the event that the sample firm is included in the first-stage regression model, its estimate of discretionary accruals is equal to the corresponding regression residual and consequently the two-stage estimation procedure collapses to a single stage"
                      Source:Detecting Earnings Management Using Cross-Sectional Abnormal Accruals Models. https://citeseerx.ist.psu.edu/viewdo...=rep1&type=pdf

                      I am not sure whether both mean the same or not but I thought it may be of some use


                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post
                        I'm afraid I can't help you on this aspect of it. I am an epidemiologist. My knowledge of finance and economics is pretty much limited to what I glean from posts here on Statalist that touch on those areas. I don't know what Roydhowdhury, S (2006) is--the forum FAQ does advise that if you mention references you should provide complete information. But even if you had provided a complete reference, I suspect it is in a journal that I would not have access to. I suggest you carefully read the methods in that paper and see if it clearly describes whether each industry-year gets a single regression covering all firms, or whether each firm in the industry-year gets its own regression based on all the other firms. If you cannot find that in the paper, then, as it is not really a statistical or Stata question, I think you would be most likely to get help by asking a colleague in your field.
                        I understand what you're saying and I am still not 100% sure because I can not find it in the original papers and related papers. However, from the quote found by lal mohan kumar, ​I think that each firm in the industry-year gets its own regression based on all the other firms. Now that I know that I need the advanced code I have a STATA related question, because the code is unfortunately not working properly. I did enter the following code:

                        Code:
                        capture program drop myregress
                        program define myregress
                        levelsof ISIN3, local(levels)
                         foreach l of local levels {
                         preserve
                         drop if ISIN3 == `l'
                         regress CFO_scLaggedAssets One_scLaggedAssets Sales_scLaggedAssets DeltaSales_scLaggedAssets
                         restore
                         replace b_One_scLaggedAssets = _b[One_scLaggedAssets] if ISIN3 == `l'
                         replace b_Sales_scLaggedAssets = _b[Sales_scLaggedAssets] if ISIN3 == `l'
                         replace b_DeltaSales_scLaggedAssets = _b[DeltaSales_scLaggedAssets] if ISIN3 == `l'
                         replace b_cons = _b[_cons] if ISIN3 == `l'
                         replace se_One_scLaggedAssets = _se[One_scLaggedAssets] if ISIN3 == `l'
                         replace se_Sales_scLaggedAssets = _se[Sales_scLaggedAssets] if ISIN3 == `l'
                         replace se_DeltaSales_scLaggedAssets = _se[DeltaSales_scLaggedAssets] if ISIN3 == `l'
                         replace r2 = e(r2) if ISIN3 == `l'
                         replace n_obs = e(N) if ISIN3 == `l'
                        }
                        
                           exit
                        end
                        
                        runby myregress, by(Industry Year)
                        The output gives me: number of by-groups = 510, bygroups with errors = 510, observations processed = 34,879 and observations saved = 0. I already looked at your earlier comment in #4 on how to overcome the errors in this code. However, I already dropped all the missing values of all the independent variables and the dependent variable. And thereafter I made sure that every Industry-Year pair has at least 15 observations, with the following code:

                        Code:
                        drop if CFO_scLaggedAssets == .
                        drop if One_scLaggedAssets == .
                        drop if Sales_scLaggedAssets == .
                        drop if DeltaSales_scLaggedAssets == .
                        egen IndustryYear = group(Industry Year)
                        egen count = count(IndustryYear), by (IndustryYear)
                        drop if count < 15
                        Thereafter, I decided to add -verbose- to the -runby- command. Where i got the error "variable b_One_scLaggedAssets not found" for every industry-year pair. I do not understand why this variable can not be found, while the other variables can be found. Do you have any idea what is causing this error and how to overcome it? I did add my log file on the code above and at the beginning of the log-file I have summarize the variable One_scLaggedAssets.

                        Thank you so much for your time.
                        Attached Files

                        Comment


                        • #13
                          Like many others, I do not open or download attachments from people I do not know. Please repost your log file by copy/pasting it into the data editor between code delimiters.

                          Also, for troubleshooting, it is important to have example data to work with. Please be sure to use the -dataex- command to do that.

                          That said, given the particular error message you are encountering, I should point out that the code in -myregress- shown in #12 requires that the variables that will store the regression output need to be created before anything can be -replace-d into them. So I suspect that what you need to do here is, before the -runby- command, prepare the data set with:

                          Code:
                          foreach v of varlist One_scLaggedAssets Sales_scLaggedAssets DeltaSales_scLaggedAssets {
                              gen b_`v' = .
                              gen se_`v' = .
                          }
                          gen b_cons = .
                          gen r2 = .
                          gen n_obs = .

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            Like many others, I do not open or download attachments from people I do not know. Please repost your log file by copy/pasting it into the data editor between code delimiters.

                            Also, for troubleshooting, it is important to have example data to work with. Please be sure to use the -dataex- command to do that.

                            That said, given the particular error message you are encountering, I should point out that the code in -myregress- shown in #12 requires that the variables that will store the regression output need to be created before anything can be -replace-d into them. So I suspect that what you need to do here is, before the -runby- command, prepare the data set with:

                            Code:
                            foreach v of varlist One_scLaggedAssets Sales_scLaggedAssets DeltaSales_scLaggedAssets {
                            gen b_`v' = .
                            gen se_`v' = .
                            }
                            gen b_cons = .
                            gen r2 = .
                            gen n_obs = .
                            Thank you so much, it worked! The command gives no more errors and all the observations are saved. I have one last question. Can I now just obtain the residuals by adding the command predict VariableName, residuals or if this done differently when using this -runby- command?

                            Thank you in advance.

                            Comment


                            • #15
                              It's not as simple as that. To get the residuals you have to dance around a little bit with variables inside -myregress- just as was done with the coefficients and standard errors. So I would revise the code as follows:

                              Code:
                              capture program drop myregress
                              program define myregress
                              levelsof ISIN3, local(levels)
                                  foreach l of local levels {
                                       preserve
                                       drop if ISIN3 == `l'
                                       regress CFO_scLaggedAssets One_scLaggedAssets Sales_scLaggedAssets DeltaSales_scLaggedAssets
                                       restore
                                       replace b_One_scLaggedAssets = _b[One_scLaggedAssets] if ISIN3 == `l'
                                       replace b_Sales_scLaggedAssets = _b[Sales_scLaggedAssets] if ISIN3 == `l'
                                       replace b_DeltaSales_scLaggedAssets = _b[DeltaSales_scLaggedAssets] if ISIN3 == `l'
                                       replace b_cons = _b[_cons] if ISIN3 == `l'
                                       replace se_One_scLaggedAssets = _se[One_scLaggedAssets] if ISIN3 == `l'
                                       replace se_Sales_scLaggedAssets = _se[Sales_scLaggedAssets] if ISIN3 == `l'
                                       replace se_DeltaSales_scLaggedAssets = _se[DeltaSales_scLaggedAssets] if ISIN3 == `l'
                                       replace r2 = e(r2) if ISIN3 == `l'
                                       replace n_obs = e(N) if ISIN3 == `l'
                                       predict r, resid
                                       replace residual = r if ISIN3 == `l'
                                       drop r
                                  }
                              
                                 exit
                              end
                              
                              foreach v of varlist One_scLaggedAssets Sales_scLaggedAssets DeltaSales_scLaggedAssets {
                                  gen b_`v' = .
                                  gen se_`v' = .
                              }
                              gen b_cons = .
                              gen r2 = .
                              gen n_obs = .
                              gen residual = .
                              
                              runby my_regress, by(Industry Year) status

                              Comment

                              Working...
                              X