Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem in running a loop - does not stop

    Dear all,

    I am currently trying to write a code to estimate the abnormal cash flow for my dataset with 70,608 observations and 30 variables.

    After checking for duplicates and creating a date variable useable with xtset, I ran the commands to calculate the abnormal cash flow, including a loop. The code that I ran is:

    gen oancf_ram=oancf/L.at
    gen l_at=L.at
    gen ram1=1/L.at
    gen ram2=sale/L.at
    gen ram3=S.sale/L.at

    levelsof obs, local(levels)
    foreach x of local levels {
    gen mark=1 if obs==run
    gen sic_lp=sic_2 if obs==run
    qui summ sic_lp
    replace sic_lp=r(mean) if sic_lp==.
    gen datadate_lp=date if obs==run
    qui summ datadate_lp
    replace datadate_lp=r(mean) if datadate_lp==.
    format datadate_lp %ty
    gen sample=1 if sic_lp==sic_2 & datadate_lp==date & l_at!=. & oancf_ram!=. & ram1!=. & ram2!=. & ram3!=.
    egen sample_sum=sum(sample) if mark!=1
    capture reg oancf_ram ram1-3 if sample==1 & mark!=1 & sample_sum>3
    capture predict u_hat_temp, resid
    capture replace u_hat_ram1=u_hat_temp if obs==run & u_hat_ram1==.
    drop mark sic_lp datadate_lp sample sample_sum
    capture drop u_hat_temp
    replace runn=runn+1
    }

    The levelsof obs, local(levels) shows all 70,608 observations. However, afterwards, the loop runs for the first time quickly, then it keeps on repeating this:

    (70,607 missing values generated)
    (70,607 missing values generated)
    (70,607 real changes made)
    (70,607 missing values generated)
    (70,607 real changes made)
    (69,793 missing values generated)
    (1 missing value generated)

    Whereby only the 69,793 sometimes changes.

    I assume that the mistake is in the line " foreach x of local levels { "
    But I cannot find out how to properly change it.

    I hope one of you can help me. Thank you in advance.

    Hidde

  • #2
    Your code is puzzling. What is inside the loop is exactly the same regardless of the local macro x as you never refer to that macro inside the loop. That said, I am at a loss to know why you don't get exactly the same messages again and again.

    Comment


    • #3
      This is a little difficult to understand. What I can make of it is that you're running a set of operations including a regression and saving residuals on a large number of levels.
      What is unclear is what are you surprised about? That Stata keeps on working? You realize you are asking Stata to execute all that code for 70,608 different values, yes?
      Explain a bit more about your goal and what it is that youre surprised about. Also explain what is obs, run and runn. Are you sure you want to execute your regression on each individual observation?

      Like Nick, I am surprised that Stata would not just repeat "(70,607 missing values generated)" 70,608 times. But also about what the utility of the loop is

      Comment


      • #4
        Thank you for your quick responses. I'm quite new with stata and I have to estimate real earnings management. Thereby, I am following this blogpost:

        https://robsonglasscock.wordpress.co...-manipulation/

        I am more or less copying the code so I think that I have simply copied part of his general commands where I had to specific commands related to my dataset

        Comment


        • #5
          It's not as crazy as it seemed. My bad.

          I said

          What is inside the loop is exactly the same regardless of the local macro x as you never refer to that macro inside the loop.
          That's still true. But the loop doesn't do exactly the same things again and again, as the variable runn is updated.

          The original code is needed to understand more of what's happening.

          it's poor style, or so I assert, to put a counter in a variable. The variable runn is initialised at 1 and incremented by 1 each time around the loop. Experienced Stata programmers would always use a macro for that purpose.

          Also, obs is just the observation number. Even if a loop over observations is needed (I can't say) I would always just write that as a forval loop from 1 to the number of observations. Putting the observation number in a variable and looping over its distinct levels using levelsof isn't going to be clearer or faster or better in any sense.

          None of that helps the question much. Essentially the code sets up a loop over observations and that's always relatively slow.

          I have no idea myself what is calculated in this field. My instinct is that this code could be speeded up enormously but I am not the person to do it. Very likely Robson himself knows much more Stata than he did when he posted this.

          PS


          Code:
           ram1-3
          looks wrong. That's not a legal varlist. Because you have capture on that the error won't halt the code, but my guess is that none of those regressions will run and results depending on them will all be missing. Indeed Robson had

          Code:
          ram1-ram3
          so that's a simple typo, but one hard for a beginner to find.
          Last edited by Nick Cox; 15 Feb 2019, 06:43.

          Comment


          • #6
            Hmm okay. I will add your comments and try to look if I can find a way to speed the code up. Thank you for your help!

            Comment


            • #7
              Ram1-3 is not part of the code that I posted.That is a modification Hidde van Lent made himself. Also, the loop Hidde wrote has "run" near the start and "runn" at the bottom. I used "runn" in all parts of my loops. That may have something to do with his problem.

              I realize the loop is inefficient, but I can explain the reason for the runn variable and the iterations approach. Real activities manipulation (aka real earnings management) and discretionary accruals are supposed to be estimated over the entire population of firms covered by Compustat. Researchers typically have some subset of the population that will be included in their sample, but the estimates should be derived from firms from the entire population.

              Concretely, let's say I have 2,000 firm-years that make up the observations that will ultimately be used in a research paper. Let's also say the time period runs from 2000-2010, and the entire population of firm-years in Compustat over this period is 100,000. One of the variables I am interested in, as either a control variable or a dependent variable, is either real activities manipulation (RAM) or discretionary accruals (DA).

              One approach would be to just use the same 2,000 firm-years to calculate RAM or DA. But the problem here is that the values of DA or RAM will be wrong since they are only based on a subset of firms. RAM and DA should be based on the entire population of firms covered by Compustat and not just whatever firms happen to be in my sample. All firms in the same sic and year should be used, which will be pulled from the 100,000, not just the 2,000. Additionally, "firm i" (each firm in the sample of 2,000) should be excluded when the estimates of DA and RAM are calculated.

              The way I approached this was to keep all 100,000 firm-years but condition the loops to only estimate DA or RAM for 2,000 firms I really care about obtaining DA and RAM estimates for and not the other 98,000 in the dataset. The mark, obs, and runn variables were a way for me to constrain the loops to iterate over the 2,000 firm-years I needed DA or RAM for while also excluding "firm i". I also wrote two blog posts about different ways to calculate DA and RAM using a "top down" and a "bottom up" approach. Note that the industry-time period regressions approach to estimate DA may also be used to estimate RAM.

              Nix Cox and Clyde Schechter improved the discretionary accruals code on a Statalist post here:

              and I updated the discretionary accruals blog post with this:

              " Clyde Schechter gave a solution that cuts down the 4 minutes processing time on this dataset to 28 seconds. He makes three improvements:

              1) use of -levelsof- instead of the outdated -vallist-

              2) instead of running one loop for year and one loop for industry, he combines these into one variable and runs one loop. This makes sense- it’s sic AND year that we care about. This never occurred to me. Doing this one step cuts the 4 minute processing time down to 2 minutes and 34 seconds.

              3) Clyde embedded an “if statement” before any of the regressions are run. My code would needlessly run the same regression over and over again as it iterated through all of the observations. This if statement, if combo[`j’] == `k’ { …., means regressions will only run when the combined two-digit sic and year (based on the grouping variable, “combo”) for the observation in the dataset equals the current two-digit sic and year (based on the grouping variable, “combo”) per the outer loop. When this is not the case, the code only lists “combo” and the observation number. Doing this step cuts the data processing time from 2 minutes 34 seconds to… 28 seconds.

              Hat’s off to both Nick and Clyde for the improvements. Here is a slight modification to Clyde’s substantial improvement to what I posted: .... "

              I was worried that people might read my blog post about RAM and run my original, inefficient loops- so I made sure to point them to the updated discretionary accruals blog post. The blog post Hidde van Lent is referring to contains the following update to point people to the modified code that Nick and Clyde wrote instead of my original code:

              " Update as of 1/17/2014: I wrote a second block of code for industry-time regressions after posting this that avoids the homemade loop shown here. The new post is about discretionary accruals, but the same approach may be used for RAM. You can see that blog post here:

              https://robsonglasscock.wordpress.co...cruals-update/ "

              and the link above includes links and references to Clyde and Nick improving the inefficient code I originally wrote and comments about everything that happened along the way.
              Last edited by Robson Glasscock; 15 Feb 2019, 11:44.

              Comment


              • #8
                Robson Glasscock Thanks for the detailed clarification. I'd forgotten my previous comments on your work.

                Comment


                • #9
                  Nick Cox You bet. Thanks to you and Clyde Schechter for all the people you both help on the forum. I'm still embarrassed that my approaches were inefficient, but my heart was in the right place with posting them. I also decided to leave the original posts up with narratives hoping that the revisions might be informative to others.

                  Comment

                  Working...
                  X