Calculating 9 months pollution exposure

Muhammad Ramzan

Join Date: Jan 2015

Posts: 173
#1

Calculating 9 months pollution exposure

16 Jan 2020, 21:43

Hi, I am working on the topic of the impact of air pollution on Child Health, combining the data from the Demographic Health Survey (DHS) and NASA satellite data imagery. I have used the location from DHS data to calculate the mean PM2.5 for each month and each cluster.

I have to construct a trimester pollution exposure by calculating mean PM2.5 for three month periods preceding month m childbirth. I have to construct a nine-month pollution exposure by calculating mean PM2.5 for nine-month periods preceding month m of childbirth.

how I calculate trimester pollution exposure and nine-month pollution exposure.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

16 Jan 2020, 22:43

When asking for help with code, it is almost always necessary to provide example data, because the code is likely to differ depending on details of the data itself and how it is organized. So please use the -dataex- program and post back with example data. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

In addition, please clarify what you mean be "preceding month m childbirth." Childbirth is, perhaps not an instant, but usually a matter of just a few hours, or at worst a couple of days. So I cannot imagine what "month m childbirth" means. Please enlighten me. Also indicate where it is found in your example data.

Finally, because I am an epidemiologist and I also have some background in environmental medicine, I know what PM2.5 is. But this is a multi-disciplinary, international forum, and I am confident that most Forum mebers are not familiar with this term. It is always best to avoid jargon here: always explain your questions in terms that anybody with a college education and a minimal statistics background would understand.
Comment

Muhammad Ramzan

Join Date: Jan 2015
Posts: 173

02 Feb 2020, 21:26

Hi, I am working on the topic of the impact of air pollution on Child Health, combining the data from the Demographic Health Survey (DHS) and NASA satellite data imagery.
Child health and mother’s health characteristics comes from DHS data, whereas the air pollution comes from the NASA satellite data imagery.
I have used the location from DHS data to calculate the mean air pollution for each month and each cluster. I have the air pollution level for each month from March 2003 to July 2018. I have a excel from for each month containing the air pollution data of each cluster.

I have to construct a trimester pollution exposure by calculating mean air pollution for three-month periods preceding month m childbirth. I have to construct a nine-month pollution exposure by calculating mean air pollution for nine-month periods preceding month m of childbirth.

how I calculate trimester pollution exposure and nine-month pollution exposure.

b1 is the month of childbirth and b2 is the year of childbirth. So, I have to have the average pollution level before the month of child birth. Suppose the child is born in June 2005, so I need the average air pollution for the month of March 2005, April 2005 and May 2005.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(v001 v002) byte v009 int v010 byte(v012 v025 v131 v133 v136 v151 v701 v702) int(v704 v716) byte(v730 bord b1) int b2 byte(b4 b5 b8) float(agedays_death nm im cm)
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 8 10 1994 1 1 12 . 0 0 0
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 7 11 1993 1 0  . 0 1 1 1
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 6  9 1988 2 1 18 . 0 0 0
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 5  8 1986 2 1 20 . 0 0 0
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 4  4 1985 2 1 21 . 0 0 0
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 3  6 1984 1 0  . 0 1 1 1
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 2  6 1983 1 0  . 0 1 1 1
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 1  5 1982 2 1 24 . 0 0 0
1 17  1 1959 48 1 2 16 2 2 3 4 13 13  . 4  9 2001 2 1  5 . 0 0 0
1 17  1 1959 48 1 2 16 2 2 3 4 13 13  . 3 12 1992 1 0  . 5 1 1 1
1 17  1 1959 48 1 2 16 2 2 3 4 13 13  . 2 12 1979 1 1 27 . 0 0 0
1 17  1 1959 48 1 2 16 2 2 3 4 13 13  . 1  3 1978 1 1 28 . 0 0 0
1 27  5 1971 35 1 2 10 6 1 2 5 39 67 41 2  2 2000 1 1  7 . 0 0 0
1 27  5 1971 35 1 2 10 6 1 2 5 39 67 41 1  6 1996 2 1 10 . 0 0 0
1 37 10 1970 36 1 2 10 5 1 3 6 39 67 38 3  7 2005 2 1  1 . 0 0 0
1 37 10 1970 36 1 2 10 5 1 3 6 39 67 38 2  1 1999 1 1  8 . 0 0 0
1 37 10 1970 36 1 2 10 5 1 3 6 39 67 38 1  6 1995 1 1 11 . 0 0 0
1 47  3 1977 29 1 2  0 6 1 0 . 36 67 42 4 11 2004 1 1  2 . 0 0 0
1 47  3 1977 29 1 2  0 6 1 0 . 36 67 42 3  6 2000 1 1  6 . 0 0 0
1 47  3 1977 29 1 2  0 6 1 0 . 36 67 42 2  9 1994 2 1 12 . 0 0 0
end
label values v025 LABE
label def LABE 1 "urban", modify
label values v131 v131
label def v131 2 "punjabi", modify
label values v133 v133
label values v151 LABL
label values b4 LABL
label def LABL 1 "male", modify
label def LABL 2 "female", modify
label values v701 v701
label def v701 0 "no education", modify
label def v701 2 "secondary", modify
label def v701 3 "higher", modify
label values v702 LABAV
label values v704 v704
label def v704 11 "accountants", modify
label def v704 13 "teachers (all levels)", modify
label def v704 36 "transport conductors", modify
label def v704 39 "clerical and related workers nec", modify
label values v716 v716
label def v716 13 "teachers (all levels)", modify
label def v716 67 "unemployed", modify
label values v730 v730
label values b5 LABN
label def LABN 0 "no", modify
label def LABN 1 "yes", modify

Air pollution data for the month of March 2000

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int DHSCLUST double ZonalSt_sh
 1   .349314004182816
 2 .34931400418281555
 3 .34931400418281555
 4 .34931400418281555
 5  .2954939901828766
 6 .34931400418281555
 7  .3211260139942169
 8  .5356400012969971
 9  .5356400012969971
10  .5356400012969971
end

Example data of air pollution for the month of April 2000.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int DHSCLUST double ZonalSt_sh
 1 .34931400418281555
 2 .34931400418281555
 3 .34931400418281555
 4 .34931400418281555
 5  .2954939901828766
 6 .34931400418281555
 7  .3211260139942169
 8  .5356400012969971
 9  .5356400012969971
10  .5356400012969971
end

thanks

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30118

02 Feb 2020, 22:36

With this data organization, you are many steps away from being able to do what you ask.

First, you need to build up a single data set that contains the air pollution data for each location in each month. It appears that you can do that by appending the individual monthly data sets that you have all together, but you must add a variable to it showing the month-year. Within that data set, it is not hard to calculate averages over 3 and 9 month preceding windows.

Second, you need to combine your month and year of birth variables into a single month-year variable. And, crucially, the variable DHSCLUST needs to be in this data set as well. Looking at the example data, I'm guessing that the variable v009 is, in fact, this variable. So in the illustrative code below, I rename it accordingly: the variable must have the same name in both data sets. if V009 isn't the DHSCLUST variable, then you need to rename whichever variable there does indicate the DHSCLUST. If there is no such variable, you have to get it: without the DHSCLUST it is impossible to properly match up the two kinds of data.

So, the code is going to look something like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(v001 v002) byte v009 int v010 byte(v012 v025 v131 v133 v136 v151 v701 v702) int(v704 v716) byte(v730 bord b1) int b2 byte(b4 b5 b8) float(agedays_death nm im cm)
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 8 10 1994 1 1 12 . 0 0 0
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 7 11 1993 1 0  . 0 1 1 1
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 6  9 1988 2 1 18 . 0 0 0
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 5  8 1986 2 1 20 . 0 0 0
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 4  4 1985 2 1 21 . 0 0 0
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 3  6 1984 1 0  . 0 1 1 1
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 2  6 1983 1 0  . 0 1 1 1
1  7  8 1962 44 1 2  5 7 1 3 4 11 67 49 1  5 1982 2 1 24 . 0 0 0
1 17  1 1959 48 1 2 16 2 2 3 4 13 13  . 4  9 2001 2 1  5 . 0 0 0
1 17  1 1959 48 1 2 16 2 2 3 4 13 13  . 3 12 1992 1 0  . 5 1 1 1
1 17  1 1959 48 1 2 16 2 2 3 4 13 13  . 2 12 1979 1 1 27 . 0 0 0
1 17  1 1959 48 1 2 16 2 2 3 4 13 13  . 1  3 1978 1 1 28 . 0 0 0
1 27  5 1971 35 1 2 10 6 1 2 5 39 67 41 2  2 2000 1 1  7 . 0 0 0
1 27  5 1971 35 1 2 10 6 1 2 5 39 67 41 1  6 1996 2 1 10 . 0 0 0
1 37 10 1970 36 1 2 10 5 1 3 6 39 67 38 3  7 2005 2 1  1 . 0 0 0
1 37 10 1970 36 1 2 10 5 1 3 6 39 67 38 2  1 1999 1 1  8 . 0 0 0
1 37 10 1970 36 1 2 10 5 1 3 6 39 67 38 1  6 1995 1 1 11 . 0 0 0
1 47  3 1977 29 1 2  0 6 1 0 . 36 67 42 4 11 2004 1 1  2 . 0 0 0
1 47  3 1977 29 1 2  0 6 1 0 . 36 67 42 3  6 2000 1 1  6 . 0 0 0
1 47  3 1977 29 1 2  0 6 1 0 . 36 67 42 2  9 1994 2 1 12 . 0 0 0
end
label values v025 LABE
label def LABE 1 "urban", modify
label values v131 v131
label def v131 2 "punjabi", modify
label values v133 v133
label values v151 LABL
label values b4 LABL
label def LABL 1 "male", modify
label def LABL 2 "female", modify
label values v701 v701
label def v701 0 "no education", modify
label def v701 2 "secondary", modify
label def v701 3 "higher", modify
label values v702 LABAV
label values v704 v704
label def v704 11 "accountants", modify
label def v704 13 "teachers (all levels)", modify
label def v704 36 "transport conductors", modify
label def v704 39 "clerical and related workers nec", modify
label values v716 v716
label def v716 13 "teachers (all levels)", modify
label def v716 67 "unemployed", modify
label values v730 v730
label values b5 LABN
label def LABN 0 "no", modify
label def LABN 1 "yes", modify
tempfile birth_data
save `birth_data'

* Example generated by -dataex-. To install: ssc install dataex
clear
input int DHSCLUST double ZonalSt_sh
 1   .349314004182816
 2 .34931400418281555
 3 .34931400418281555
 4 .34931400418281555
 5  .2954939901828766
 6 .34931400418281555
 7  .3211260139942169
 8  .5356400012969971
 9  .5356400012969971
10  .5356400012969971
end
tempfile 2000m3 // MARCH 2000
save `2000m3'

* Example generated by -dataex-. To install: ssc install dataex
clear
input int DHSCLUST double ZonalSt_sh
 1 .34931400418281555
 2 .34931400418281555
 3 .34931400418281555
 4 .34931400418281555
 5  .2954939901828766
 6 .34931400418281555
 7  .3211260139942169
 8  .5356400012969971
 9  .5356400012969971
10  .5356400012969971
end
tempfile 2000m4 // APRIL 200
save `2000m4'

//  ABOVE ARE THE DATA SETS
//  ACTIVE CODE STARTS HERE

//  COMBINE THE MONTHLY AIR POLLUTION DATA SETS
clear
tempfile all_months
save `all_months', emptyok
foreach f in 2000m3 2000m4 { // EXPAND TO FULL LIST OF MONTHLY FILES
    use ``f'', clear // IF YOUR DATA SETS ARE NOT TEMPFILES, -use `f'-
    gen mdate = tm(`f')
    append using `all_months'
    save `"`all_months'"', replace
}
format mdate %tm
//  CALCULATE THREE AND NINE MONTH LAGGING AVERAGES
rangestat (mean) lag3_Zonal = ZonalSt_sh, by(DHSCLUST) interval(mdate -3 -1)
rangestat (mean) lag9_Zonal = ZonalSt_sh, by(DHSCLUST) interval(mdate -9 -1)
isid DHSCLUST mdate, sort
save `"`all_months'"', replace


//  PREPARE THE BIRTH DATA SET TO MERGE WITH THE POLLUTION DATA
use `birth_data', clear
rename v009 DHSCLUST // THIS IS JUST A GUESS BECAUSE IT'S 1-10
gen mdate = ym(b2, b1)
assert missing(mdate) == missing(b2, b1)
format mdate %tm

//  NOW PUT THE BIRTH DATA SET TOGETHER WITH THE COMBINED POLLUTION DATA
merge m:1 DHSCLUST mdate using `all_months', keep(master match)

Notes:

1. -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.
2. It is likely that your monthly pollution data sets are real data sets, not tempfiles. So in the loop that appends them all together, you will -use `f'-, or if the filenames contain embedded blanks -use `"`f'"'- instead of -use ``f''-.
3. Similarly if the filenames are not 2000m3, 2000m4, etc. then you will have to have some other code that extracts the actual Stata numeric code for that month from the filename. Since I don't know what the filenames are, I can't help you with that.
4. The code does not illustrate the solution well in your example data because your example pollution data is only for March and April of 2000, but none of the example birthdates are in those months.

Although not essential to solving this particular problem, I highly recommend that you rename all the variables in these data sets to something that has mnemonic value. If you spend a few weeks away from this data and come back to it, it is unlikely you will remember what any of these variables are with names like v131 or b5.

Comment

Muhammad Ramzan

Join Date: Jan 2015

Posts: 173
#5

03 Feb 2020, 15:57

Hi thanks for your quick reply.

please what is the use of this command,

tempfile all_months Regards
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#6

03 Feb 2020, 16:29

This command tells Stata to create a temporary file and to store its name in the local macro all_months. Thereafter, any reference to `all_months' is a reference to that file. The nice thing about temporary files is that they are, well, temporary. They are very useful for holding configurations of data that are needed for intermediate calculations but are not needed for the long run. Temporary files are automatically deleted after the program that creates them ends, so you don't have to go about hunting them down and erasing them to clear out space on your hard drive.

In this instance, the purpose was to create a file that would hold all of the data in the various monthly pollution files. In fact, I would guess that you might want to save that result as a permanent file as it may well have other uses later on in your project. But as I was writing demonstration code to run on my machine, and I have no reason to save your example data on my computer for the long haul. So I put it into a temporary file: available for the moment and gone as soon as I'm done with it.
Comment
Muhammad Ramzan

Join Date: Jan 2015

Posts: 173
#7

03 Feb 2020, 17:59

thanks my air pollution excel files names are ZS_2000_03, ZS_2000_04 so on till ZS_2018_07. should i go and change my excel file names to 2000m3 formate.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30118

03 Feb 2020, 18:22

You could, but I wouldn't. I'd rather have Stata calculate the file names while it loops over the months from March 2003 through Julyh 2018.

Also, I wouldn't work with the Excel files here. First get all of them imported to Stata. I have the sense that, in fact, you have already done that. If not, do it now. I'll assume that the Stata files you create are named ZS_2003_03.dta through ZS_2018_07.dta. Then you can use this code:

Code:

//  COMBINE THE MONTHLY AIR POLLUTION DATA SETS
clear
local first_month = tm(2003m3)
local last_month = tm(2018m7)


tempfile all_months
save `all_months', emptyok
forvalues m = `first_month'/`last_month' {
    local yy = year(dofm(`m'))
    local mm: display %02.0f =month(dofm(`m'))
    use ZS_`yy'_`mm', clear
    gen mdate = `m'
    append using `all_months'
    save `"`all_months'"', replace
}
format mdate %tm
// CALCULATE THREE AND NINE MONTH LAGGING AVERAGES
rangestat (mean) lag3_Zonal = ZonalSt_sh, by(DHSCLUST) interval(mdate -3 -1)
rangestat (mean) lag9_Zonal = ZonalSt_sh, by(DHSCLUST) interval(mdate -9 -1)
isid DHSCLUST mdate, sort
save `"`all_months'"', replace

The parts of the code that are different from what was shown in #4 are shown in bold face.

Comment

Muhammad Ramzan

Join Date: Jan 2015

Posts: 173
#9

03 Feb 2020, 19:52

Thanks

file ZS_2003_03.dta not found

I am getting this error message, please how to specify the path here, please

my Stata data files are located at E:\Health and air quality\December\Extract2

Thanks
Comment
Muhammad Ramzan

Join Date: Jan 2015

Posts: 173
#10

03 Feb 2020, 20:37

Hi it has worked by specifying the working directory
Comment
Muhammad Ramzan

Join Date: Jan 2015

Posts: 173
#11

03 Feb 2020, 21:26

Thanks a lot Clyde Schechter
Comment
Muhammad Ramzan

Join Date: Jan 2015

Posts: 173
#12

18 Feb 2020, 08:29

HI Clyde Schechter

what is the purpose of this command
assert missing(mdate) == missing(b2, b1)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#13

18 Feb 2020, 09:15

The code immediately before that line creates a Stata internal format monthly date variable from your separate month and year variables b1 and b2. The purpose of the assert command is to verify that this was successfully concluded. So, if your data set had an observation with b1 = 13, you have a problem because the month must always be an integer between 1 and 12. When the Stata -monthly()- function encounters an invalid month, it returns missing value. So the -assert- command would notice that in that observation, mdate is missing even though b1 and b2 are not: which can only happen if b1 in that observation does not define a valid month. Since the remainder of the work you will be doing requires a valid monthly date variable, this is a crucial check on the validity of your data.
Comment

Muhammad Ramzan

Join Date: Jan 2015
Posts: 173

#14

19 Feb 2020, 19:16

Hi sir if i have my pollution data in one file rather than a separate file for each month how I will have to change the commands

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int ClusterPoints double(ZS20003 ZS20004 ZS20005 ZS20006 ZS20007 ZS20008)
1  .2480315 .53149605 .47244096 .37007874 .68503934 .62598425
2  .2519685 .42519686 .31102362 .35039371 .46850392 .53543305
3  .2480315 .53149605 .47244096 .37007874 .68503934 .62598425
4  .2480315 .53149605 .74409449 .64960629 .62204725 .68503934
5 .22834645 .55118108 .81889766 .61417323 .64566928 .66929132
end

------------------ copy up to and including the previous line ------------------

Listed 5 out of 972 observations

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#15

19 Feb 2020, 19:20

From what you show it appears that there are more changes than just having everything in one file. You have only yearly, not monthly data, and the data are in wide layout.

With only yearly data, it isn't possible to calculate 3 month or 9 month lags.

So it is not a matter of changing the data: the problem as originally stated cannot be solved with this data. So think about how you want to change the problem itself and then post back.
Comment

Announcement