Averages across time for a variable

Rachel Ennis

Join Date: Feb 2018

Posts: 5
#1

Averages across time for a variable

13 Feb 2018, 07:59

Hi all,

I am wishing to calculate the average of the 80th percentile (variable name p80) of household income distribution in a given geographical unit (variable name geo) in the year t (variable name year with years 2001-2014) and the prior two years (t-1 and t-2).

Currently I have p80 values for all geographical units for each year. There are duplicates in the p80 variable observations as this income distribution percentile cut off is assigned to individuals in the same geographical unit reported in the same year.

After many attempts, I am struggling to find a way to formulate this in stata.

Any help would be appreciated, thanks in advance.

Rachel
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

13 Feb 2018, 08:15

In the future, whenever you want advice on code, you should post an example of your data. I will here show you code that will work if your data is as I imagine it to be. If your data are different (and they very well may be) then we will have both wasted our time.

Code:

rangestat (mean) p80, by(country) interval(year -2 0)

-rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. -ssc install rangestat- to get it, if you don't already have it.

This assumes that your data are long layout. That is, you have one observation per country for each time. It also assumes that p80 and time are numeric variables, and that the time variable is in units such that t-2 and t-1 and t really mean a difference of 1 (as opposed to daily dates and a unit of years whereby the difference would be 365 instead of 1).

To show example data, use the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, read -help dataex- for the simple instructions for using it.

Remember: if you want help with code, you need to show example data, and to show example data, always use -dataex-.

Last edited by Clyde Schechter; 13 Feb 2018, 08:17.
Comment
Rachel Ennis

Join Date: Feb 2018

Posts: 5
#3

13 Feb 2018, 08:56

Thank you for your quick response and for advising me on how to obtain a data example, Clyde.

Unfortunately I did not describe my data well as there are several observations for each geographical unit and time.

My data looks as follows.

p80 geo year
714.91 1 2001
714.91 1 2001
714.91 1 2001
714.91 1 2001
714.91 1 2001
714.91 1 2001
714.91 1 2001
714.91 1 2001

Here, geo contains values ranging from 1 to 14, which represents 14 different geographical areas. There are 14 time periods for each geographical unit. As you can see, there are many repeats in p80 observations as the data contains several households that are assigned to the same geographical area and recorded in the same year, thus having the same 80th percentile income distribution cutoff.

Thus your suggested code will not work with this data.

I appreciate the time you have already dedicated to answer my question but any other advice would be appreciated.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

13 Feb 2018, 09:03

Actually, this is a somewhat different layout from any of the possibilities I had imagined. But the code in #2 actually might work with this data. It depends on how you want to handle the multiplicity.

I see that you have multiple observations for the same geo and year combination. All 8 of those observations, in the example you show, have the same value of p80. Why do you have 8 observations with the same information? Are there other variables in your data on which these observations differ? Do you want to include all 8 of those observations in the average? Or do you want to pick just one for each country-year combination? Are there situations in your data where there are multiple observations for a given country-year combination but the values of p80 are not all the same? If so, how do you want to handle that situation?
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#5

13 Feb 2018, 09:08

I underline Clyde's (excellent as always) general advice.

As I've said somewhere else recently, the mean of a constant is just that constant. This seems obvious and lacking in interest when met as a statistical principle, but it's often useful in programming. This may have some bearing on the problem here.
Comment
Rachel Ennis

Join Date: Feb 2018

Posts: 5
#6

13 Feb 2018, 09:25

As I am looking at household income with reference to income distribution in their area, each household living in the same area is assigned the same value for the p80 variable. I have generated the income distribution percentiles by area and assigned them to households in each area. These income distribution percentiles change yearly for each area. There are other variables within the data that differ for households, most importantly hh income but also other hh characteristics and hh expenditure data.

I do not wish to include all 8 observations in the average, I am merely looking to calculate an average for each area. It doesn't not matter whether I calculate the average of 80th percentile income for each geographical unit across t, t-1, t-2, and then assign this value to all households within each area. Or somehow calculate this for each household as the values for households within the same area will be the same.

Since I have dropped households with zero income and zero data regarding their area, there should be no missing values. So for every area-year combinations the values of p80 are the same.

Last edited by Rachel Ennis; 13 Feb 2018, 09:40.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#7

13 Feb 2018, 09:34

That's clearer, but it still leaves a question in my mind. So, for example, let's say there are 100 households for a given geo-year combination, and p80 = 700 for all of them, and let's say for that same geo in the preceding year there are just 80 households, and p80 = 600 for all of them. And let's say that for that same geo in the second preceding year there are 90 households, and p80 = 650. Do you want an unweighted average of p80 over the years, i.e. (600 + 650 + 700)/3 = 650, or do you want the average weighted by number of households, i.e. (100*700 + 80*600 + 90*650)/(100+80+90) = 653.7037? Or perhaps there are always the same number of households for each year in any area, in which case it makes no difference?
Comment
Rachel Ennis

Join Date: Feb 2018

Posts: 5
#8

14 Feb 2018, 09:38

Thank you for flagging up this issue regarding weighting.

The code I used for generating the income percentiles is as follows:

/* Creating hhinc percentiles by geographical unit */
sort year geo

by year geo : egen p10 = pctile(hhincome), p(10)
by year geo : egen p20 = pctile(hhincome), p(20)
by year geo : egen p30 = pctile(hhincome), p(30)
by year geo : egen p40 = pctile(hhincome), p(40)
by year geo : egen p50 = pctile(hhincome), p(50)
by year geo : egen p60 = pctile(hhincome), p(60)
by year geo : egen p70 = pctile(hhincome), p(70)
by year geo : egen p80 = pctile(hhincome), p(80)
by year geo : egen p90 = pctile(hhincome), p(90)

However, these have not been weighted which is my mistake. Within the data there are given annual and quarterly weights for each household which adjust for non-response and to gross to population estimates (see below).

weighta weightq
3.292575 13.1703
3.26015 13.0406
2.8177 11.2708
2.7522 11.0088
3.997525 15.9901
4.017525 16.0701
3.918375 15.6735
2.83755 11.3502
2.089425 8.3577
4.100125 16.4005
2.789925 11.1597

Having looked into it, I am aware that the egen function does not allow weights to be used. Would you be able to advise on this?

From here, I suspect I would be wanting an unweighted average of p80 across the past three years within each geographical unit.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#9

14 Feb 2018, 10:06

So, since I don't have an example of the data with all the variables needed, this is untested. You may have to fix it:

Code:

capture program drop one_quantile program define one_quantile foreach q of numlist 10(10)90 { qreg hhincome [pweight = weighta], q(`q') gen p`q' = _b[_cons] } exit end runby one_quantile, by(year geo) status

-runby- is by Robert Picard and me, and is available from SSC.

The program one_quantile uses -qreg- to do a weighted calculation of the quantile, looping over quantiles from 10 through 90. -runby- iterates the program by year and geo combination. I assumed those weights are pweights; they kind of look like it. But -qreg- also supports fweights (which these clearly are not) and iweights (which they could be). So do consult the survey documentation to verify what kind of weights these are.

That should get you the quantiles.

Once you have the quantiles, to get an unweighted running average, I think it would then be like this:

Code:

by year geo, sort: gen flagged_p80 = p80 if _n == 1 rangestat (mean) mean_p80 = flagged_p80, by(country) interval(year -2 0)

The first line simply picks out a single observation of p80 (which is the same for all observations within geo year). Then -rangestat- will compute the running window mean you asked for in #1.
Comment
Rachel Ennis

Join Date: Feb 2018

Posts: 5
#10

19 Feb 2018, 09:04

Thanks Clyde. Your suggested code ran smoothly and I have managed to obtain the quantiles.

I am also having issues with creating 'income buckets'. I wish to assign each household a value that reflects which household income bucket they fall into. For example, buckets for every £50 of current weekly income, so a household with income £120 would be assigned income bucket '3'.

I assume this could done using similar code, however, since I am unfamiliar with it I am not sure how to alter the code to suit this task.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#11

19 Feb 2018, 09:17

50 * floor(income/50) rounds down to produce ..., -50, 0, 50, 100, ....

I recommend that system strongly over cryptic buckets ..., 1, 2, 3, ... It is no more complicated and much clearer than such labelling by successive integers.

Conversely 50 * ceil(income/50) rounds up.
1 like
Comment
Latoya Sundack

Join Date: Jul 2019

Posts: 67
#12

29 Jan 2020, 12:47

Dear Statalister,

I have the monthly long term mean by regions.

I would like to create a rainfall shock variable equals 1 if annual rainfall is greater (>) than the region's 80 percentile of the region's long-term mean, 0 if between the 20th and 80th percentile, and -1 if below the 20th percentile.

I have tried this code to try to see if I can generate the to 80, but I don't think it's the correct approach.

Code:

bysort reg year: egen median=pctile(meanrain), p(80) gen top 80

Can you please advise?

Kind Regards!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#13

29 Jan 2020, 14:01

This is very confusing. What does "80 percentile of the region's long-term mean" mean? The long term mean sounds like a single number. You can only calculate a percentile of a distribution of numbers? What variable do you have that you want the 80th percentile of its distribution? And to what variable do you want that compared? Please show example data, using the -dataex- command, and explain more clearly what is wanted.

If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Added: Also, if you have monthly observations on regions and you are doing this separately for each year (which the code you show seems to imply), then barring the possibility of tied values, the 80th percentile and 20th percentile of a sample of 12 months will basically partition the data into the 2 highest, the 2 lowest, and everything in between. Is that what you intend?

Last edited by Clyde Schechter; 29 Jan 2020, 14:04.
Comment
Latoya Sundack

Join Date: Jul 2019

Posts: 67
#14

03 Feb 2020, 05:10

Dear Clyde,

Sorry for not being very clear.

I have review the paper that I would like to follow to generate my rainfall shock variable and I found that "rainfall shock equals one if annual rainfall is greater than the region's eightieth percentile of rainfall, zero if between the twentieth and eightieth percentiles, and minus one if below the twentieth percentile. I have daily rainfall data by stations within each regions. Previously, I have created the shock variable using the monthly long term mean. Can you please tell me how this can be done? An example of the daily data is shown below:

Code:

station year month d e f g h i j k l m n o p q r s t u v w x y z aa ab ac ad ae af ag ah Leo 2005 1 65 0 28.4 17.4 0 40.3 61.4 8.7 1.2 0 0 0 4.1 9.5 12.8 69 81 67 72 6.7 74 96.2 1.7 .5 4.4 18.3 16.7 33 3 16.5 91.2

I have used the code below to generate the region given their station names.

Code:

gen reg=. replace reg=1 if stationaname=="His"

I am grateful for your help.

Regards!
Comment
Latoya Sundack

Join Date: Jul 2019

Posts: 67
#15

03 Feb 2020, 08:58

Originally posted by Latoya Sundack View Post

Dear Clyde,

Sorry for not being very clear.

I have reviewed the paper that I would like to follow to generate my rainfall shock variable and I found that "rainfall shock equals one if annual rainfall is greater than the region's eightieth percentile of rainfall, zero if between the twentieth and eightieth percentiles, and minus one if below the twentieth percentile. I have daily rainfall data by stations within each regions. Previously, I have created the shock variable using the monthly long term mean. Can you please tell me how this can be done? An example of the daily data is shown below:

Code:

station year month d e f g h i j k l m n o p q r s t u v w x y z aa ab ac ad ae af ag ah Leo 2005 1 65 0 28.4 17.4 0 40.3 61.4 8.7 1.2 0 0 0 4.1 9.5 12.8 69 81 67 72 6.7 74 96.2 1.7 .5 4.4 18.3 16.7 33 3 16.5 91.2

I have used the code below to generate the region given their station names.

Code:

gen reg=. replace reg=1 if stationaname=="His"

I am grateful for your help.

Regards!

Thank you!
Comment

Announcement

Averages across time for a variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment