Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Retrieve information from "moving sum" data

    Hello, I have at hand a dataset that resembles the toy example I'm reporting below. I have data about car accidents in a number of counties that are reported monthly as a "moving sum" -- so to speak -- of the past 12 months, and this data is available for the past 5 years. My aim is to retrieve -- if possible -- the count of accidents for a given county in a given month and year; intuitively I should try some kind of recursive pairwise difference between subsequent annual counts, but I have no idea how to get there.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 start_date str7 end_date byte county_id int tot_accidents
    "2019m1" "2019m12" 1 635
    "2019m2" "2020m1"  1 674
    "2019m3" "2020m2"  1 592
    "2019m4" "2020m3"  1 598
    "2019m5" "2020m4"  1 618
    "2019m1" "2019m12" 2 376
    "2019m2" "2020m1"  2 389
    "2019m3" "2020m2"  2 342
    "2019m4" "2020m3"  2 318
    "2019m5" "2020m4"  2 365
    end
    start_date and end_date refer to the first and last month the sum of the accidents (tot_accidents) refers to. As you can see, the count of the first observation is the sum of accidents in County 1 from January 2019 to December 2019; the second observation is the sum of accidents in County 1 from February 2019 to January 2020, and so on. My aim is to retrieve in some way the monthly figure for each month and county, therefore my target dataset should look like this:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 month byte county_id str1 monthly_accidents
    "2019m1" 1 "?"
    "2019m2" 1 "?"
    "2019m3" 1 "?"
    "2019m4" 1 "?"
    "2019m5" 1 "?"
    "2019m1" 2 "?"
    "2019m2" 2 "?"
    "2019m3" 2 "?"
    "2019m4" 2 "?"
    "2019m5" 2 "?"
    end
    Can anyone help me out?
    Last edited by Luna Diaz; 26 Feb 2023, 13:03.

  • #2
    Please, can anyone help me out? Is there any way I can achieve this with Stata?

    To be clear, I'm aware I don't have sufficient information to retrieve monthly estimates for the first and last bunch of months (the first 12 and last 12?), but there should be a way to do so for the observations "in the middle", right? Alas, my knowledge of math only goes so far...

    Thanks in advance to anyone who can weigh in.

    Comment


    • #3
      Cross-posted at https://stats.stackexchange.com/ques...al-moving-sums (where it is, I guess, doomed as off-topic).

      I agree that you have a challenging problem and would post a solution if I had one.

      Comment


      • #4
        Originally posted by Nick Cox View Post
        Cross-posted at https://stats.stackexchange.com/ques...al-moving-sums (where it is, I guess, doomed as off-topic).
        Thank you for reporting the cross-post, next time I'll be quicker at doing so myself.

        I admit I am not super-familiar with the posting etiquette over at StackExchange, but I'm not sure why it's an off-topic question there but not here? I often find helpful answers via Google searches there, and I tried using adequate tags for the post... Should I have posted on the Data Science SE? Stats seems more active.

        I agree that you have a challenging problem and would post a solution if I had one.
        Thank you anyway!

        Comment


        • #5
          Your post on Cross Validated is now closed with an explanation why.

          I post from time to time on Cross Validated and also on Stack Overflow. I can't see that your question is more appropriate at either place or that you will get better answers than are likely here.

          Comment


          • #6
            I have a partial solution. I do not believe a complete solution is mathematically possible. The following code, when adapted to your data set, will give correct answers where it can, and missing values where it can't. The code is demonstrated with a toy data set that resembles your data structure (5 years of data and 12 month running totals). The toy data set includes the actual monthly values and, at the end, verifies that the results (new variable wanted), where not missing, are correct.

            Code:
            //  CREATE A TOY DATA SET
            clear*
            set obs 60
            gen int start_month = tm(2018m12) + _n
            format start_month %tm
            expand 2
            by start_month, sort: gen county_id = _n
            isid county_id start_month, sort
            
            set seed 1234
            gen monthly_accidents = runiformint(35, 75)
            
            rangestat (sum) tot_accidents = monthly_accidents, interval(start_month -11 0) ///
                by(county_id)
            //  TOY DATA SET IS NOW IN MEMORY
            
            //  CALCULATE MONTHLY DIFFERENCES IN THE TOTALS
            xtset county_id start_month
            gen delta = D1.tot_accidents
            
            //  EXTRACT THE MONTHLY ACCIDENTS FROM THE SERIES OF DIFFERENCES IN ONE COUNTY
            capture program drop one_county
            program define one_county
                gen wanted = 0
                replace wanted = . in 1/12
                forvalues i = `=_N'(-1)13 {
                    forvalues j = `i'(-12)1 {
                        replace wanted = wanted + delta[`j'] in `i'
                    }
                }
                exit
            end
            
            //  DO IT FOR EACH COUNTY
            runby one_county, by(county_id)
            
            //  VERIFY THAT RESULTS, WHEN NOT MISSING, ARE CORRECT.
            assert wanted == monthly_accidents if !missing(wanted)
            -runby- is written by Robert Picard and me, and is available from SSC.

            Comment


            • #7
              [The following code, when adapted to your data set, will give correct answers where it can, and missing values where it can't.
              Thank you Clyde, this is very much appreciated! I will adapt your solution to my dataset ASAP, and let you know how it turns out!

              Comment


              • #8
                Dear Clyde Schechter, I have tried your code and it's pretty close to what I need to recover the monthly information I plan to use -- and losing the information for the first 12 months is actually not that much of an issue, because my outcome is only available for a narrower timeframe.

                There is an issue though, and it's probably my own fault -- I should have described the dataset I have at hand more thoroughly. As it stands the toy dataset is built in such a way that tot_accidents behaves as a rolling sum (in Jan 2019 I only have the count for Jan 2019, in Feb 2019 the count for Jan 2019 + Feb 2019, and so on). But in reality what I have in my original dataset is not a rolling sum, but something weirder that I would describe as a "moving sum" (for lack of a better term): I have the sum of accidents for the 12-month period that goes from Jan 2019 to Dec 2019, then the sum of accidents for the 12-month period that goes from Feb 2019 to Jan 2020, then the sum of accidents for the 12-month period that goes from Mar 2019 to Feb 2020, and so on.

                Do you believe your solution could be adapted to work on a dataset of this kind?

                Thank you again for your help, it is much appreciated.


                EDIT: disregard the part I put in blue (is there a strike-through in the editor?), I had misunderstood your code because I only glanced at the first few lines of observations. As far as I can tell your toy dataset behaves like mine.

                A question remains, though -- I'm not clear on why it is not possible to recover the information for the first month of a given year (provided it's not the first year of my data, so 2019)? The "moving sums" I have bridge over from a year to another, so building the loop in a way that doesn't rely on the count from 1 to 12 should do the trick, right?
                Last edited by Luna Diaz; 28 Feb 2023, 16:16.

                Comment


                • #9
                  Actually, the toy data set is what you call a moving sum. It only appears to be otherwise if you look at the first 12 observations in each county. That's because the process has to begin somewhere. So in the first 11 months, the "12 month total" only extends back to the start at 2019m1 because there is no data earlier to fill out the year. (Or, equivalently, we could imagine that the monthly number is always 0 before that first observation.) That makes it look like a rolling sum. But if you look starting at 2020m1 in the toy data set, you will se that it is no longer just accumulating: it is now moving. But that is an inherent limitation to my solution method: it assumes that the data goes all the way back to the beginning of data collection. If that is not true, the method will not work. And I can't think of any way to solve it off hand.

                  Let me ponder it for a while and see if I can figure something else out.

                  Comment


                  • #10
                    Yes Clyde Schechter, thank you, it took me a while but I got why the toy dataset behaves like that for the first few observations! If it's helpful, I can probably go back a few more years with my data (it will just take a bit of tedious, manual work on ugly .xlsx files...), so I can start with data from Jan 2017 onwards to recover the monthly figures for Jan 2019 onwards, and then it shouldn't be a problem for your code, right?

                    The only standing issue is being able to somehow recover (see the EDIT I made to my previous post, unfortunately I posted at the same time you were replying) the monthly figures for Jan 2020, Jan 2021, and so on...


                    EDIT: additional information, I had a look at those .xlsx files and can confirm the first dataset is a 12-month long dataset, so there is no "monthly figure" to use as the starting point.
                    Last edited by Luna Diaz; 28 Feb 2023, 17:23. Reason: Additional information

                    Comment

                    Working...
                    X