Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to compute moving sums by id in panel data?

    Hi,

    I have a panel data set, where individuals are identified by "id" and time periods are identified by "date". For each id-date observation, there is a numeric variable "var".

    What I want to do is to create a rolling-window variable "ms", which sums var over the prior 10 dates for each id. For some id-date observations, there might be less than 10 dates available - say only 5 since the first date for that id - then ms should be the sum of var over these 5 observations.

    How can I achieve this in a simple way?

  • #2
    It is unclear to me whether you want the running some to be over whatever the 10 preceding dates are (which might include dates going back centuries, for all we know about the data), or just over dates in the preceding 10 days. (Or maybe by dates you actually mean years?) So I'll guess you want the running sum over the preceding ten days. (By which I mean from 1 day before through 10 days before, not current day through 9 days before.) Then it goes like this:

    Code:
    rangestat (sum) ms = var, by(id) interval(date -10 -1)
    You need to install the -rangestat- program, from SSC, written by Robert Picard, Nick Cox, and Roberto Ferrer.

    If this is not what you meant, when posting back be sure to use the -dataex- command to show an example of your actual data and then illustrate what results you would want to see from that example. Suggesting code to solve an unclear problem on imaginary data is challenging!

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      It is unclear to me whether you want the running some to be over whatever the 10 preceding dates are (which might include dates going back centuries, for all we know about the data), or just over dates in the preceding 10 days. (Or maybe by dates you actually mean years?) So I'll guess you want the running sum over the preceding ten days. (By which I mean from 1 day before through 10 days before, not current day through 9 days before.) Then it goes like this:

      Code:
      rangestat (sum) ms = var, by(id) interval(date -10 -1)
      You need to install the -rangestat- program, from SSC, written by Robert Picard, Nick Cox, and Roberto Ferrer.

      If this is not what you meant, when posting back be sure to use the -dataex- command to show an example of your actual data and then illustrate what results you would want to see from that example. Suggesting code to solve an unclear problem on imaginary data is challenging!
      Hi Clyde,

      Thanks for your reply and letting me know about the two commands. The date horizon is monthly, so to be clear, what I want is to sum over -9, -8, -7, ..., -1, 0 months for each observation. Below is an example generated by dataex.


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str6 id long date float var
      "012589" 10743 1
      "012589" 10773 1
      "012589" 10804 1
      "012589" 10835 1
      "012589" 10865 1
      "012589" 10896 1
      "012589" 10926 1
      "012589" 10957 1
      "012589" 10988 1
      "012589" 11016 1
      "012589" 11047 1
      "012589" 11077 1
      "012589" 11108 1
      "012589" 11138 1
      "012589" 11169 1
      "012589" 11200 1
      "012589" 11230 1
      "012589" 11261 1
      "012589" 11291 1
      "012589" 11322 1
      "012589" 11353 1
      "012589" 11381 1
      "012589" 11412 1
      "012589" 11442 1
      "012589" 11473 1
      "012589" 12173 1
      "012589" 12387 0
      "012589" 12418 0
      "012589" 12449 0
      "012589" 12477 0
      "012589" 12508 1
      "012589" 12538 1
      "012589" 12569 1
      "012589" 12599 1
      "012589" 12630 1
      "012589" 12661 1
      "012589" 12691 1
      "012589" 12722 1
      "012589" 12752 1
      "012589" 12783 1
      "012589" 12814 1
      "012589" 12842 1
      "012589" 12873 1
      "012589" 12903 1
      "012589" 12934 1
      "012589" 12964 1
      "012589" 12995 1
      "012589" 13026 1
      "012589" 13056 1
      "012589" 13087 1
      "012589" 13117 1
      "012589" 13148 1
      "012589" 13179 1
      "012589" 13208 1
      "012589" 13239 1
      "012589" 13269 1
      "012589" 13300 1
      "012589" 13330 1
      "012589" 13361 1
      "012589" 13392 1
      "012589" 13422 1
      "012589" 13453 1
      "012589" 13483 1
      "012589" 13514 1
      "012589" 13545 1
      "012589" 13573 1
      "012589" 13604 1
      "012589" 13634 1
      "012589" 13665 1
      "012589" 13695 1
      "012589" 13726 1
      "012589" 13757 1
      "012589" 13787 1
      "012589" 13818 1
      "012589" 13848 1
      "012589" 13879 1
      "012589" 13910 1
      "012589" 13938 1
      "012589" 13969 1
      "012589" 13999 1
      "012589" 14030 1
      "012589" 14060 1
      "012589" 14091 1
      "012589" 14122 1
      "012589" 14152 1
      "012589" 14183 1
      "012589" 14213 1
      "012589" 14244 1
      "012589" 14275 1
      "012589" 14303 1
      "012589" 14334 1
      "012589" 14364 1
      "012589" 14395 1
      "012589" 14425 1
      "012589" 14456 1
      "012589" 14487 1
      "012589" 14517 1
      "012589" 14548 1
      "012589" 14578 1
      "012589" 14609 1
      end
      format %d date
      There are two concerns for me. The first is that summation does might exceed the beginning of each id and include var from the previous id. From the code you suggested, I guess "by(id)" command solves this issue. The second is that the months might not necessarily be consecutive as a month might be missing. If that's the case, the 10 months including the current month might go beyond -9 month. Can we also prevent this from happening?

      Comment


      • #4
        What your example now makes clear is that you have monthly data indexed by daily dates at the end of each month. That's just going make life difficult for you until you convert to monthly dates. Then Clyde's code is good.

        Code:
         
         gen mdate = mofd(date) format mdate %tm rangestat (sum) ms = var, by(id) interval(mdate -10 -1)
        The help for rangestat spells out what the by() option does:
        by(varlist) groups observations, so that statistics are generated using only observations within the same group. For example, this option should be specified when you wish calculations to be restricted to given panels or given times for panel or longitudinal data.
        rangestat with the code specified will calculate sums for all values between 1 and 10 months before the current date. If you have missing values they will just be ignored and the sum based on non-missing values alone.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          What your example now makes clear is that you have monthly data indexed by daily dates at the end of each month. That's just going make life difficult for you until you convert to monthly dates. Then Clyde's code is good.

          Code:
          gen mdate = mofd(date) format mdate %tm rangestat (sum) ms = var, by(id) interval(mdate -10 -1)
          The help for rangestat spells out what the by() option does: rangestat with the code specified will calculate sums for all values between 1 and 10 months before the current date. If you have missing values they will just be ignored and the sum based on non-missing values alone.
          Yest, that's exactly my second concern. If a month is missing, then a month outside of the 10-month range will be included in summation. However, my intention was to sum over all non-missing months within the 10-month range.

          Comment


          • #6
            To be clear, what Nick is telling you is that -rangestat- will only include in the summation those observations that fall between 1 and 10 months of the current month. It is not simply the 10 months in the data that happen to precede the current one (and thus might be older than 10 months). It includes only those observations where the value of the month variable is numerically between 1 and 10 less than the value of the current month variable--that might be any number of observations. So -rangestat- is doing exactly what you are looking for.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              To be clear, what Nick is telling you is that -rangestat- will only include in the summation those observations that fall between 1 and 10 months of the current month. It is not simply the 10 months in the data that happen to precede the current one (and thus might be older than 10 months). It includes only those observations where the value of the month variable is numerically between 1 and 10 less than the value of the current month variable--that might be any number of observations. So -rangestat- is doing exactly what you are looking for.
              I see. Sorry for my misinterpretation, and thanks again for your clarification!

              Comment


              • #8
                Can I revive this thread, I tried Nick Cox's code but something went awry

                I have a county-level panel dataset where counties are identified by "fips_num" and month-year is identified by "date". For each id-date observation there is a numeric variable "arrests_month".

                I'd like to compute a rolling sum variable, the sum of arrest_month in the preceding 10 months by county. For example, for january 2014, I'd like to add up arrests from march 2013 through december 2013. If there are no observations for a month-year, I'd like it to ignore and sum for the months that are available.

                Here is the code I tried and sample of data, there seems to be something wrong with the way the month is being extracted and the mdate variable.

                Code:
                gen mdate = mofd(date)
                format mdate %tm
                rangestat (sum) arrests_10mo = arrests_month, by(fips_num) interval(mdate -10 -1)
                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str27 county_surrounding_area float(fips_num date) int arrests_month float mdate double arrests_10mo
                "Los Angeles County, CA" 6037 657 384 21    .
                "Los Angeles County, CA" 6037 658 343 21    .
                "Los Angeles County, CA" 6037 659 267 21    .
                "Los Angeles County, CA" 6037 660 227 21    .
                "Los Angeles County, CA" 6037 661 316 21    .
                "Los Angeles County, CA" 6037 662 415 21    .
                "Los Angeles County, CA" 6037 663 368 21    .
                "Los Angeles County, CA" 6037 664 257 21    .
                "Los Angeles County, CA" 6037 665 252 21    .
                "Los Angeles County, CA" 6037 666 213 21    .
                "Los Angeles County, CA" 6037 667 287 21    .
                "Los Angeles County, CA" 6037 668 256 21    .
                "Los Angeles County, CA" 6037 669 220 21    .
                "Los Angeles County, CA" 6037 670 211 22 3805
                "Los Angeles County, CA" 6037 671 200 22 3805
                "Los Angeles County, CA" 6037 672 174 22 3805
                "Los Angeles County, CA" 6037 673 186 22 3805
                "Los Angeles County, CA" 6037 674 221 22 3805
                "Los Angeles County, CA" 6037 675 218 22 3805
                "Los Angeles County, CA" 6037 676 228 22 3805
                "Los Angeles County, CA" 6037 677 197 22 3805
                "Los Angeles County, CA" 6037 678 238 22 3805
                "Los Angeles County, CA" 6037 679 246 22 3805
                "Los Angeles County, CA" 6037 680 202 22 3805
                "Los Angeles County, CA" 6037 681 224 22 3805
                "Los Angeles County, CA" 6037 682 198 22 3805
                "Los Angeles County, CA" 6037 683 162 22 3805
                "Los Angeles County, CA" 6037 684 168 22 3805
                "Los Angeles County, CA" 6037 685 267 22 3805
                "Los Angeles County, CA" 6037 686 222 22 3805
                end
                format %tm date
                format %tm mdate
                Last edited by Aresha Martinez; 11 Jan 2023, 22:32. Reason: tagging Nick

                Comment


                • #9
                  #8

                  Thanks for the data example.

                  The code is doing what you told it to do. mdate takes on two distinct values in the data example, 21 and 22. So, for 21 the ten previous values would be 11 to 20, which are not in the data, so the sum is returned as missing. For 22 the ten previous values would be 12 to 21, and as just said only 21 is present in the data, so the sum is returned as 3805. I didn't check that but it looks right, as the typical term is about 300 and there are 13 values.

                  It seems to me that you chose the wrong variable. date looks more like a Stata monthly date to me.

                  Comment


                  • #10
                    Originally posted by Nick Cox View Post
                    #8

                    Thanks for the data example.

                    The code is doing what you told it to do. mdate takes on two distinct values in the data example, 21 and 22. So, for 21 the ten previous values would be 11 to 20, which are not in the data, so the sum is returned as missing. For 22 the ten previous values would be 12 to 21, and as just said only 21 is present in the data, so the sum is returned as 3805. I didn't check that but it looks right, as the typical term is about 300 and there are 13 values.

                    It seems to me that you chose the wrong variable. date looks more like a Stata monthly date to me.
                    Thanks Nick! You were right, the date was year-month (e.g. 2015m10) so the mofd function wasn't working. I used the code below per the advice of this forum and it worked!

                    Code:
                    gen mdate = mofd(dofm(date))
                    format mdate %tm
                    rangestat (sum) arrests_10mo = arrests_month, by(fips_num) interval(mdate -10 -1)

                    Comment


                    • #11
                      Glad you worked it out but mofd(dofm(date)) is a mapping from monthly date to daily date to monthly date. It does no harm but it is, net, an identity transformation.

                      Code:
                      . di mofd(dofm(750))
                      750
                      
                      . di %tm mofd(dofm(750))
                       2022m7
                      As guessed in #9 you could have used mdate all along.


                      Comment

                      Working...
                      X