Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compute variance over a time window with repeated time values within panel

    Hello everyone,

    I have been thinking about the issue below for a few hours now, I tried several things, but I started to give up.

    My problem is conceptually simple: I have a variable, word_count, that measures the number of characters of newspaper headlines per day. There are several newspapers issued per day and therefore several headlines. That is, each word count takes a separate row, which creates repeated time values within panel. Then, there is some corporate event taking place on random days, and is captured using the dummy variable corporate_event. I would like to calculate the variance of the word count (word_count) over a window of 5 days (-2,-2,0,+1+2) around the corporate event (i.e., when corporate_event=1).

    Below is a sample of how the data looks like:

    company_id date corporate_event word_count variance
    1690 15-Jul-14 0 31 .
    1690 15-Jul-14 0 27 .
    1690 20-Jul-14 0 36 402.0751
    1690 20-Jul-14 0 20 402.0751
    1690 20-Jul-14 0 40 402.0751
    1690 21-Jul-14 0 43 402.0751
    1690 21-Jul-14 0 14 402.0751
    1690 21-Jul-14 0 25 402.0751
    1690 21-Jul-14 0 64 402.0751
    1690 21-Jul-14 0 0 402.0751
    1690 22-Jul-14 1 0 402.0751
    1690 22-Jul-14 0 16 402.0751
    1690 22-Jul-14 0 16 402.0751
    1690 22-Jul-14 0 22 402.0751
    1690 22-Jul-14 0 33 402.0751
    1690 22-Jul-14 0 36 402.0751
    1690 22-Jul-14 0 20 402.0751
    1690 23-Jul-14 0 40 402.0751
    1690 23-Jul-14 0 43 402.0751
    1690 23-Jul-14 0 17 402.0751
    1690 23-Jul-14 0 43 402.0751
    1690 23-Jul-14 0 60 402.0751
    1690 23-Jul-14 0 20 402.0751
    1690 24-Jul-14 0 26 402.0751
    1690 24-Jul-14 0 92 402.0751
    1690 24-Jul-14 0 44 402.0751
    1690 24-Jul-14 0 27 402.0751
    1690 24-Jul-14 0 48 402.0751
    1690 25-Jul-14 0 35 .
    1690 25-Jul-14 0 32 .
    1690 25-Jul-14 0 45 .

    In the above example, I have computed the variance using Excel just to show what I am trying to do in Stata.

    Thank you very much for your time and possible suggestions.

    Mostafa

  • #2
    I have some ideas about this, but the question is unclear in a few respects.

    First, if your real data set includes more than one company_id, do you want to do this separately for each company_id?
    Second, I see that you are looking to take the value of the variance on the date of the corporate event and spread it to each date within 2 days before or after it. But what happens if there are two corporate events that are within 3 or fewer days of each other? How do you resolve this conflict for the dates which are within 2 days of both of them?

    Added: In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
    Last edited by Clyde Schechter; 26 Jun 2023, 21:17.

    Comment


    • #3
      Hi Clyde,

      Thank you for your questions.

      The answer to the first question: Yes, I would like to do that for each company_id separately. I did not mention this initially because I thought that I will customize any suggested code to fit my need (I might also run the code for each company-month or company-year).

      The answer to the second question: if there are two corporate events within 3 or fewer days from each other, then I would like to compute the variance of word_count starting -2 days from the first corporate event date to +2 days from the second corporate event date.

      I will greatly appreciate it if you can provide a suggestion for the case where there are no corporate events close to each other (I suppose the code will be simpler in this case), and another suggestion where there might be overlapping corporate events.

      Thank you very much.
      Mostafa

      Comment


      • #4
        The following code works whether there are events within 3 days of each other or not:
        Code:
        //  IDENTIFY WINDOWS
        rangestat (max) inrange = corporate_event, by(company_id) interval(date -2 2)
        
        //  CALCULATE VARIANCE IN INRANGE WINDOWS
        by company_id (date), sort: egen range_num = sum(inrange != inrange[_n-1])
        by company_id range_num (date), sort: egen variance = sd(word_count) if inrange
        replace variance = variance^2
        -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

        Note: This does not produce the numerical result you show. I have checked the variance calculation independently and verified that what this gives is correct. I do not know where your 402.0751 comes from.

        Comment


        • #5
          Hi Clyde,

          First and foremost, thanks very much for the code you provided. It works fine with the data sample I have provided. I will try it with the full dataset and let you know if I run into some issue (e.g., with two overlapping events).

          As for my variance computation, I made a mistake in Excel because I first computed the variance given an initial word_count (the variance numbers were copied and pasted as text), then changed a couple of numbers in word_count (to keep the post simple), but I mistakenly did not update the computation of the variance afterwards. I realized this mistake in the morning [along with a typo in the initial post: (-2 -1 0 +1 +2) instead of (-2 -2 0 +1 +2)], but it was too late to edit. I'm sorry about any inconvenience.

          Thank you very much for the help.
          Mostafa

          Comment


          • #6
            Hi again Clyde,

            Before you provided me with your shortcut code above (thankfully), I was trying a few things that I ended up using for a simple task that is not directly related to my initial question in the thread.

            The code below is customized to do a simple averaging (and manipulation). Specifically, for each company (company_id) in a given year (year), there is a pre-computed variable called singular_var. The code is supposed to calculate the sum of singular_var for the current year (t0) in addition to two lagged years (t-1 and t-2), and then store the value in a new variable called triple_var. I also want the code to do the following: when a company has less that three consecutive values in years t0, t-1, and t-2, I would like to take the average of the available observations (either two or one observations) and multiply it by 3 (so that all observations have an equal weight). The issue is that the code is running for long hours and it has never finished (several trials, including overnights). I tried it on a small panel dataset (5 companies, 10 years) and it worked just fine. My question is whether the code can be more efficient.

            Code:
            gen triple_variable=.
            levelsof company_id, local(companies)
            foreach c of local companies {
                levelsof year if company_id == `c', local(year)
                foreach y of local year {
                    local start_year = `y' - 2
                    local end_year = `y'
                    qui sum singular_var if company_id== `c' & inrange(year, `start_year', `end_year'), meanonly
                    local mean = r(mean)
                    replace triple_var= `mean'*3 if company_id== `c' & inrange(year, `start_year', `end_year')
                }
            }
            Thank you very much for your invaluable help.
            Mostafa

            Comment


            • #7
              This should be much faster:
              Code:
              rangestat (mean) singular_var, by(company) interval(year -2 0)
              gen triple_var = singular_var * 3
              In general, looping over the levels of a variable is slow in Stata. The reason is that the -if- conditions have to be evaluated in the entire data set and that takes time. When the data set is large, it can be painfully slow. While sometimes there is no choice, when you are thinking of doing this, first see if it can be done with -by:-, -runby- (by Robert Picard and me, from SSC), or -rangestat- or -rangerun- (by Robert Picard, available from SSC) instead.

              Comment

              Working...
              X