Hi Statalist,
I have a question about how to appropriately calculate moving averages in panel data. I'm really new to -tsset-, so I figure I am doing something very simply wrong, but can't quite figure out what it is. It likely has to do with the structure of the dataset.
My data describe a cohort study where individuals are tested for COVID-19 on specific dates. Some of those people might test positive (obviously the rest are negative). I'm interested in graphing a 7-day moving average of the test positivity rate. There are multiple observations per day, but only one observation per person--something like this:
Since I am interested in the test positivity rate, I've calculated the number of tests per day and number of positive tests per day:
Which gives me something like this:
Graphing those all by themselves yielded too much variability and a line that was too jagged and hard to understand. So I used -tsset-:
And then tried to calculate a 7-day moving average of positivity_rate:
And wound up with something identical to positivity_rate:
I was instead expecting values for ma1 that would be a moving average over 7 (unique) days. In the case of 1Jan2022, I had imagined that might be an average of positivity_rate values from 30Dec2021 - 5Jan2022.
Since I am new to -tsset-, I cannot tell if perhaps the reason for this is that I have specified the -tsset- statement incorrectly, or perhaps calculated -ma1- incorrectly. A third distinct possibility is that the system does not like the fact that I have the same -positivity_rate- variable for all rows that have the same day. However, selecting only one of the values per day, using
yielded similar results.
I know this is a long question, but if anyone has experience with tsset and can identify what I'm doing wrong, I'd be super grateful.
I have a question about how to appropriately calculate moving averages in panel data. I'm really new to -tsset-, so I figure I am doing something very simply wrong, but can't quite figure out what it is. It likely has to do with the structure of the dataset.
My data describe a cohort study where individuals are tested for COVID-19 on specific dates. Some of those people might test positive (obviously the rest are negative). I'm interested in graphing a 7-day moving average of the test positivity rate. There are multiple observations per day, but only one observation per person--something like this:
participant_id | date_of_test | test_result |
1 | 1Jan2022 | 0 |
2 | 1Jan2022 | 1 |
3 | 1Jan2022 | 0 |
4 | 1Jan2022 | 0 |
5 | 2Jan2022 | 1 |
6 | 2Jan2022 | 1 |
7 | 2Jan2022 | 0 |
Code:
bys date_of_test: gen pos_total = sum(test_result) egen max_pos = max(pos_total), by(date_of_test) egen max_tests = count(covid_result), by(date_of_test) gen positivity_rate = max_pos/max_tests
participant_id | date_of_test | test_result | pos_total | max_pos | max_tests | positivity_rate |
1 | 1Jan2022 | 0 | 0 | 1 | 4 | 0.25 |
2 | 1Jan2022 | 1 | 1 | 1 | 4 | 0.25 |
3 | 1Jan2022 | 0 | 1 | 1 | 4 | 0.25 |
4 | 1Jan2022 | 0 | 1 | 1 | 4 | 0.25 |
5 | 2Jan2022 | 1 | 1 | 2 | 3 | 0.67 |
6 | 2Jan2022 | 1 | 2 | 2 | 3 | 0.67 |
7 | 2Jan2022 | 0 | 2 | 2 | 3 | 0.67 |
Code:
tsset participant_id date_of_test, daily
Code:
tssmooth ma ma1 = positivity_rate, window(2 1 4)
participant_id | date_of_test | test_result | pos_total | max_pos | max_tests | positivity_rate | ma1 |
1 | 1Jan2022 | 0 | 0 | 1 | 4 | 0.25 | 0.25 |
2 | 1Jan2022 | 1 | 1 | 1 | 4 | 0.25 | 0.25 |
3 | 1Jan2022 | 0 | 1 | 1 | 4 | 0.25 | 0.25 |
4 | 1Jan2022 | 0 | 1 | 1 | 4 | 0.25 | 0.25 |
5 | 2Jan2022 | 1 | 1 | 2 | 3 | 0.67 | 0.67 |
6 | 2Jan2022 | 1 | 2 | 2 | 3 | 0.67 | 0.67 |
7 | 2Jan2022 | 0 | 2 | 2 | 3 | 0.67 | 0.67 |
Since I am new to -tsset-, I cannot tell if perhaps the reason for this is that I have specified the -tsset- statement incorrectly, or perhaps calculated -ma1- incorrectly. A third distinct possibility is that the system does not like the fact that I have the same -positivity_rate- variable for all rows that have the same day. However, selecting only one of the values per day, using
Code:
bys date_of_test: gen nvals = _n == _N egen max_pos = max(pos_total), by(date_of_test) egen max_tests = count(covid_result), by(date_of_test) gen positivity_rate = max_pos/max_tests if nvals == 1
I know this is a long question, but if anyone has experience with tsset and can identify what I'm doing wrong, I'd be super grateful.
Comment