Rolling when start and end dates vary by id

Inna Petrunyk

Join Date: Oct 2014

Posts: 31
#1

Rolling when start and end dates vary by id

23 Aug 2016, 06:28

Dear statalists,
I have a panel with mothers (id), date (time), and daily pollution measures between the date of child's birth and the date of child's conception (poll).The aim is to study the effect of mother's esposure (while pregnant) to pollution (weekly means) in the time period between date of conception and date of child's birth on birth outcome. The focus is on the effect of exposure to pollution in the 1st, 2nd,...week of pregnancy, so backward calculation is needed. Which Stata command is proper in this case (rolling, loop with forval)?
I appreciate any help.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#2

23 Aug 2016, 09:27

I think you need to say more about your approach. On the one hand it sounds like you are planning to do a separate regression for each mother (which is the only way I can make sense out of thinking about -rolling- or a loop), but each birth will have only a single outcome--so the dependent variable would have no variation in any of those regressions and you would have no information about effects of anything. Am I missing something?

My first instincts for this would be either to calculate summary measures of the pollution over salient periods of time (maybe each trimester, or some shorter periods of time if you want to be more fine grained, but not so fine-grained as each day or week), and then aggregate the data to one observation per mother with those summary measures as predictor variables in a single analysis that incorporates all mothers. Or, to do a multi-level model. But without knowing more specifics, I can't advise beyond that.
Comment
Inna Petrunyk

Join Date: Oct 2014

Posts: 31
#3

24 Aug 2016, 03:08

Thanks Clyde, your first intuition is correct.
But in order to be more precise, I should give some more information about the original datasets I start from. I have two separate datasets, one is a cross section with mothers and the corresponsing birth outcome (dataset M). The relevant information here is motherid, region of birth, birth outcome, dateofbirth, dateofconception and gestational age (how many days a mother was pregnant). Another dataset (dataset P) is a panel with region id and the corresponding daily (date)pollution measures (pollution). My idea was to

1) generate a sequence of dates between the date of conception and date of birth for each mother in order to get a panel
local start=date(dateofconception, "DMY")
local end=date(dateofbirth, "DMY")
local ob=`end' - `start' + 1
by motherid: set obs `ob'
by motherid: egen date=seq(), from(`start')to(`end')
format %td date

2) merge dataset M with dataset P by region and date, in order to get the pollution measures for each mother in the time period between date of conception and date of birth

3) get for each mother weekly means of pollutions measures for the number of weeks of gestational age
tsset motherid date
rolling mean_poll=r(mean), window(7) stepsize(7): sum pollution

4) for each mother numerate weeks (week) in order to know to which gestational week the weekly pollution measure mean_poll refers and not only have start and end variables generated by Stata while rolling;

5) reshape wide mean_poll, i(motherid) j(week) in order to get cross sections and estimate the effect of mean_poll in the first, second ... gestational week on birth outcome.

Is this procedure correct? Here the challenge is the dataset design. Thanks for your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#4

24 Aug 2016, 09:43

I think I have a better understanding of what you are trying to do. I have some doubts, that I'll go into below. But first, some comments on the code.

Your code under 1) tries to use local macros where they won't work, and misunderstands how -set obs- works. What you need for 1) is more like this:

Code:

gen start = date(dateofconception, "DMY") gen end = date(dateofbirth, "DMY") gen int n_days = end - start + 1 expand n_days by motherid, sort: gen date = start + _n - 1 format date start end %td drop n_days

2) seems correct.

3) Using -rolling- to do this seems like overkill to me. I would do it this way:

Code:

by motherid (date), sort: gen int week_num = ceil(_n/7) by motherid week, sort: egen mean_pollution = mean(pollution) by motherid week: keep if _n == 1

Note: You could do the essence of this with -collapse-, but you would have to make some provision for all the variables that do not occur in this code but need to be kept around. So I think doing it this way is, in the end, simpler.

At this point your data set contains the original data from M, along with week numbers and the corresponding averages of the pollution data, in long layout.

After this, we part company on data design. I don't know exactly what your analysis plan is, but in most circumstances it is easier to proceed with the long layout that the above delivers. Reshaping to wide is likely to complicate your life. That said, if your plan is to include each week's pollution as a separate predictor in a model and your goal is to try to identify which weeks have the strongest association with birth outcomes, then -reshape wide- is necessary for your approach.

But let me critique that goal. First, the week numbers are actually some blend of reality and fantasy. We never really know when a gestation starts. We have various ways of estimating that, by menstrual dates or by ultrasound, but those still leave a fair margin of error of a few weeks. That kind of noise in the data means that we are unlikely to be able to distinguish effects of week 1 vs week 2 or week 3 because of this measurement error. And if the results did seem to support such a distinction, I, for one, wouldn't trust them. So if the goal is to assess the effects of pollution at different time periods during pregnancy, I would be inclined to aggregate up to a longer time period than a week. I'm not sure just how long a period is enough to overcome this limitation, but my hunch is that it's something like a month.
Comment
Inna Petrunyk

Join Date: Oct 2014

Posts: 31
#5

24 Aug 2016, 13:24

I think I can aggregate the data to a longer time period, having the right code it's easy now. Generally, in the literature trimesters are used, but it is a too aggregated level for my purpose. Your suggestions were very helpful, thanks a lot.
Best wishes
Comment

Announcement

Rolling when start and end dates vary by id

Comment

Comment

Comment

Comment