Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with moving averages and tssmooth after tsset

    Hi Statalist,

    I have a question about how to appropriately calculate moving averages in panel data. I'm really new to -tsset-, so I figure I am doing something very simply wrong, but can't quite figure out what it is. It likely has to do with the structure of the dataset.

    My data describe a cohort study where individuals are tested for COVID-19 on specific dates. Some of those people might test positive (obviously the rest are negative). I'm interested in graphing a 7-day moving average of the test positivity rate. There are multiple observations per day, but only one observation per person--something like this:
    participant_id date_of_test test_result
    1 1Jan2022 0
    2 1Jan2022 1
    3 1Jan2022 0
    4 1Jan2022 0
    5 2Jan2022 1
    6 2Jan2022 1
    7 2Jan2022 0
    Since I am interested in the test positivity rate, I've calculated the number of tests per day and number of positive tests per day:
    Code:
    bys date_of_test: gen pos_total = sum(test_result)
    egen max_pos = max(pos_total), by(date_of_test)
    egen max_tests = count(covid_result), by(date_of_test)
    gen positivity_rate = max_pos/max_tests
    Which gives me something like this:
    participant_id date_of_test test_result pos_total max_pos max_tests positivity_rate
    1 1Jan2022 0 0 1 4 0.25
    2 1Jan2022 1 1 1 4 0.25
    3 1Jan2022 0 1 1 4 0.25
    4 1Jan2022 0 1 1 4 0.25
    5 2Jan2022 1 1 2 3 0.67
    6 2Jan2022 1 2 2 3 0.67
    7 2Jan2022 0 2 2 3 0.67
    Graphing those all by themselves yielded too much variability and a line that was too jagged and hard to understand. So I used -tsset-:
    Code:
    tsset participant_id date_of_test, daily
    And then tried to calculate a 7-day moving average of positivity_rate:
    Code:
    tssmooth ma ma1 = positivity_rate, window(2 1 4)
    And wound up with something identical to positivity_rate:
    participant_id date_of_test test_result pos_total max_pos max_tests positivity_rate ma1
    1 1Jan2022 0 0 1 4 0.25 0.25
    2 1Jan2022 1 1 1 4 0.25 0.25
    3 1Jan2022 0 1 1 4 0.25 0.25
    4 1Jan2022 0 1 1 4 0.25 0.25
    5 2Jan2022 1 1 2 3 0.67 0.67
    6 2Jan2022 1 2 2 3 0.67 0.67
    7 2Jan2022 0 2 2 3 0.67 0.67
    I was instead expecting values for ma1 that would be a moving average over 7 (unique) days. In the case of 1Jan2022, I had imagined that might be an average of positivity_rate values from 30Dec2021 - 5Jan2022.

    Since I am new to -tsset-, I cannot tell if perhaps the reason for this is that I have specified the -tsset- statement incorrectly, or perhaps calculated -ma1- incorrectly. A third distinct possibility is that the system does not like the fact that I have the same -positivity_rate- variable for all rows that have the same day. However, selecting only one of the values per day, using
    Code:
    bys date_of_test: gen nvals = _n == _N
    egen max_pos = max(pos_total), by(date_of_test)
    egen max_tests = count(covid_result), by(date_of_test)
    gen positivity_rate = max_pos/max_tests if nvals == 1
    yielded similar results.

    I know this is a long question, but if anyone has experience with tsset and can identify what I'm doing wrong, I'd be super grateful.


  • #2
    What you are showing is not a time series, and it is not suitable for use with -tssmooth-. (I'm actually surprised that -tssmooth- even gave you any results; I would have expected an error message instead.) What you need to do is convert the data to a time series, which means a single observation for each date. So, before the point where you did your -tsset- command you could
    Code:
    collapse (first) positivity_rate, by(date_of_test)
    tsset date_of_test
    tssmooth ma ma1 = positivity_rate, window(2 1 4)
    and you will get what you want.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.



    Comment


    • #3
      Thanks so much Clyde, this is really helpful. Is it a necessary condition of tsset (and other commands like xtset or stset) that data need to have a single observation for each date before setting as time-series?

      Thanks also for the tip about -dataex-; I didn't know about this command but will be sure to use it in the future. I appreciate it!

      Comment


      • #4
        It appears that you want to calculate period averages. You can use the ceiling and floor functions to do this. To create a time variable that starts at one with each unit being a day, subtract the minimum value of date from each date and add 1.

        Code:
        qui sum date_of_test
        gen time= date_of_test - `r(min)' + 1
        If the dates are not consecutive but you want to impose that they be consecutive:

        Code:
        egen time= group(date_of_test)
        Then you can bin as follows. Below, assume that we want 3 day averages:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(id time value)
        1 1 10
        1 2 10
        1 3 30
        1 4 10
        1 5 10
        1 6 20
        2 1 30
        2 2 10
        2 3  0
        2 4 10
        2 5 15
        2 6 10
        3 1  0
        3 2 10
        3 3 10
        3 4 10
        3 5  0
        3 6 30
        end
        
        gen period= ceil(time/3)
        bys id period: egen wanted= mean(value)
        Res.:

        Code:
        . l, sepby(id period)
        
             +---------------------------------------+
             | id   time   value   period     wanted |
             |---------------------------------------|
          1. |  1      1      10        1   16.66667 |
          2. |  1      2      10        1   16.66667 |
          3. |  1      3      30        1   16.66667 |
             |---------------------------------------|
          4. |  1      4      10        2   13.33333 |
          5. |  1      5      10        2   13.33333 |
          6. |  1      6      20        2   13.33333 |
             |---------------------------------------|
          7. |  2      1      30        1   13.33333 |
          8. |  2      2      10        1   13.33333 |
          9. |  2      3       0        1   13.33333 |
             |---------------------------------------|
         10. |  2      4      10        2   11.66667 |
         11. |  2      5      15        2   11.66667 |
         12. |  2      6      10        2   11.66667 |
             |---------------------------------------|
         13. |  3      1       0        1   6.666667 |
         14. |  3      2      10        1   6.666667 |
         15. |  3      3      10        1   6.666667 |
             |---------------------------------------|
         16. |  3      4      10        2   13.33333 |
         17. |  3      5       0        2   13.33333 |
         18. |  3      6      30        2   13.33333 |
             +---------------------------------------+
        You could also just collapse the data after binning. For a review of the binning technique, see https://journals.sagepub.com/doi/pdf...867X1801800311.

        ADDED: Crossed with #2 and #3.
        Last edited by Andrew Musau; 03 Oct 2023, 10:38.

        Comment


        • #5
          Re #3. No -tsset- allows you to specify both a grouping variable (like patient ID in your data) and a time variable, and you can have one observation for each combination of the grouping and time variable. There are many commands (for panels and cross-sectional data) that work with this kind of data. But -tssmooth -ma- is designed to work with a single time series--i.e. no grouping variable, and just one-observation per time period. Most of the other time-series commands check that before proceeding and will simply halt with an error message if you violate that.

          I'll also assume that by now you have read Andrew Musau's response in #4 where he offers a different solution. And it will also produce different results, because it is calculating something different. So you need to understand the difference and decide whether his approach or mine is suitable for your purposes. My response in #2 calculates the test positivity rate for each day and then calculates an equally-weighted moving average of the daily test positivity rates. Andrew's response in #4 stays with the uncollapsed data and calculates a test-volume-weighted moving average. That is, in his calculation, a day that has more tests done will contribute more to the end result. In my calculation, all days are counted equally even though they may have different numbers of tests. There is another difference. His approach calculates averages for consecutive non-overlapping groups of days. Mine calculates averages for overlapping groups of days.

          Comment


          • #6
            Code:
            tssmooth ma
            is fine with panel data. I don't think this is clear from the help but it is explicit in the manual entry. But be that as it is, I concur that it doesn't appear to be what you (Maria) want until you collapse (or otherwise reduce) the data to a single time series.

            I would add my bias that if moving averages are worth calculating at all, the simplest case of uniform weights is not the best choice. Binomial weights or exponential often work much better.

            That said, I'd expect substantial time-of-week and holiday effects in this kind of data, so it may be that you need to model all that before you can get useful results.

            Comment


            • #7
              I stand corrected with regard to -tssmooth- and panel data. Nick has it right. But, indeed, its treatment of it is not what O.P. is looking for. In fact, in her example data, each patient appears an only one day, so each panel is a single observation. I can imagine that in real life a patient might have tests done on multiple days, but probably not on a large number of consecutive days (at least for most commonly used lab tests). I do think she is looking for a smooth of the daily (aggregated over patients) averages.

              Comment


              • #8
                Clyde, you're correct that i am looking for smoothed daily averages (over patients, not per-patient). And Nick, you were also correct that I wanted to use 'collapse' to make the data into a single time-series. I'll definitely explore binomial or exponential weights for this, and really appreciate the suggestion! And finally, Andrew, thank you so much for teaching me some new coding techniques. I think my needs might be satisfied with 'collapse' in this circumstance, but definitely helpful to see other options that I may need in the future.

                Comment

                Working...
                X