No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Averages across time for a variable

    Hi all,

    I am wishing to calculate the average of the 80th percentile (variable name p80) of household income distribution in a given geographical unit (variable name geo) in the year t (variable name year with years 2001-2014) and the prior two years (t-1 and t-2).

    Currently I have p80 values for all geographical units for each year. There are duplicates in the p80 variable observations as this income distribution percentile cut off is assigned to individuals in the same geographical unit reported in the same year.

    After many attempts, I am struggling to find a way to formulate this in stata.

    Any help would be appreciated, thanks in advance.


  • #2
    In the future, whenever you want advice on code, you should post an example of your data. I will here show you code that will work if your data is as I imagine it to be. If your data are different (and they very well may be) then we will have both wasted our time.

    rangestat (mean) p80, by(country) interval(year -2 0)
    -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. -ssc install rangestat- to get it, if you don't already have it.

    This assumes that your data are long layout. That is, you have one observation per country for each time. It also assumes that p80 and time are numeric variables, and that the time variable is in units such that t-2 and t-1 and t really mean a difference of 1 (as opposed to daily dates and a unit of years whereby the difference would be 365 instead of 1).

    To show example data, use the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, read -help dataex- for the simple instructions for using it.

    Remember: if you want help with code, you need to show example data, and to show example data, always use -dataex-.
    Last edited by Clyde Schechter; 13 Feb 2018, 09:17.


    • #3
      Thank you for your quick response and for advising me on how to obtain a data example, Clyde.

      Unfortunately I did not describe my data well as there are several observations for each geographical unit and time.

      My data looks as follows.

      p80 geo year
      714.91 1 2001
      714.91 1 2001
      714.91 1 2001
      714.91 1 2001
      714.91 1 2001
      714.91 1 2001
      714.91 1 2001
      714.91 1 2001

      Here, geo contains values ranging from 1 to 14, which represents 14 different geographical areas. There are 14 time periods for each geographical unit. As you can see, there are many repeats in p80 observations as the data contains several households that are assigned to the same geographical area and recorded in the same year, thus having the same 80th percentile income distribution cutoff.

      Thus your suggested code will not work with this data.

      I appreciate the time you have already dedicated to answer my question but any other advice would be appreciated.


      • #4
        Actually, this is a somewhat different layout from any of the possibilities I had imagined. But the code in #2 actually might work with this data. It depends on how you want to handle the multiplicity.

        I see that you have multiple observations for the same geo and year combination. All 8 of those observations, in the example you show, have the same value of p80. Why do you have 8 observations with the same information? Are there other variables in your data on which these observations differ? Do you want to include all 8 of those observations in the average? Or do you want to pick just one for each country-year combination? Are there situations in your data where there are multiple observations for a given country-year combination but the values of p80 are not all the same? If so, how do you want to handle that situation?


        • #5
          I underline Clyde's (excellent as always) general advice.

          As I've said somewhere else recently, the mean of a constant is just that constant. This seems obvious and lacking in interest when met as a statistical principle, but it's often useful in programming. This may have some bearing on the problem here.


          • #6
            As I am looking at household income with reference to income distribution in their area, each household living in the same area is assigned the same value for the p80 variable. I have generated the income distribution percentiles by area and assigned them to households in each area. These income distribution percentiles change yearly for each area. There are other variables within the data that differ for households, most importantly hh income but also other hh characteristics and hh expenditure data.

            I do not wish to include all 8 observations in the average, I am merely looking to calculate an average for each area. It doesn't not matter whether I calculate the average of 80th percentile income for each geographical unit across t, t-1, t-2, and then assign this value to all households within each area. Or somehow calculate this for each household as the values for households within the same area will be the same.

            Since I have dropped households with zero income and zero data regarding their area, there should be no missing values. So for every area-year combinations the values of p80 are the same.

            Last edited by Rachel Ennis; 13 Feb 2018, 10:40.


            • #7
              That's clearer, but it still leaves a question in my mind. So, for example, let's say there are 100 households for a given geo-year combination, and p80 = 700 for all of them, and let's say for that same geo in the preceding year there are just 80 households, and p80 = 600 for all of them. And let's say that for that same geo in the second preceding year there are 90 households, and p80 = 650. Do you want an unweighted average of p80 over the years, i.e. (600 + 650 + 700)/3 = 650, or do you want the average weighted by number of households, i.e. (100*700 + 80*600 + 90*650)/(100+80+90) = 653.7037? Or perhaps there are always the same number of households for each year in any area, in which case it makes no difference?


              • #8
                Thank you for flagging up this issue regarding weighting.

                The code I used for generating the income percentiles is as follows:

                /* Creating hhinc percentiles by geographical unit */
                sort year geo

                by year geo : egen p10 = pctile(hhincome), p(10)
                by year geo : egen p20 = pctile(hhincome), p(20)
                by year geo : egen p30 = pctile(hhincome), p(30)
                by year geo : egen p40 = pctile(hhincome), p(40)
                by year geo : egen p50 = pctile(hhincome), p(50)
                by year geo : egen p60 = pctile(hhincome), p(60)
                by year geo : egen p70 = pctile(hhincome), p(70)
                by year geo : egen p80 = pctile(hhincome), p(80)
                by year geo : egen p90 = pctile(hhincome), p(90)

                However, these have not been weighted which is my mistake. Within the data there are given annual and quarterly weights for each household which adjust for non-response and to gross to population estimates (see below).

                weighta weightq
                3.292575 13.1703
                3.26015 13.0406
                2.8177 11.2708
                2.7522 11.0088
                3.997525 15.9901
                4.017525 16.0701
                3.918375 15.6735
                2.83755 11.3502
                2.089425 8.3577
                4.100125 16.4005
                2.789925 11.1597

                Having looked into it, I am aware that the egen function does not allow weights to be used. Would you be able to advise on this?

                From here, I suspect I would be wanting an unweighted average of p80 across the past three years within each geographical unit.


                • #9
                  So, since I don't have an example of the data with all the variables needed, this is untested. You may have to fix it:

                  capture program drop one_quantile
                  program define one_quantile
                      foreach q of numlist 10(10)90 {
                          qreg hhincome [pweight = weighta], q(`q')
                          gen p`q' = _b[_cons]
                  runby one_quantile, by(year geo) status
                  -runby- is by Robert Picard and me, and is available from SSC.

                  The program one_quantile uses -qreg- to do a weighted calculation of the quantile, looping over quantiles from 10 through 90. -runby- iterates the program by year and geo combination. I assumed those weights are pweights; they kind of look like it. But -qreg- also supports fweights (which these clearly are not) and iweights (which they could be). So do consult the survey documentation to verify what kind of weights these are.

                  That should get you the quantiles.

                  Once you have the quantiles, to get an unweighted running average, I think it would then be like this:

                  by year geo, sort: gen flagged_p80 = p80 if _n == 1
                  rangestat (mean) mean_p80 = flagged_p80, by(country) interval(year -2 0)
                  The first line simply picks out a single observation of p80 (which is the same for all observations within geo year). Then -rangestat- will compute the running window mean you asked for in #1.


                  • #10
                    Thanks Clyde. Your suggested code ran smoothly and I have managed to obtain the quantiles.

                    I am also having issues with creating 'income buckets'. I wish to assign each household a value that reflects which household income bucket they fall into. For example, buckets for every £50 of current weekly income, so a household with income £120 would be assigned income bucket '3'.

                    I assume this could done using similar code, however, since I am unfamiliar with it I am not sure how to alter the code to suit this task.


                    • #11
                      50 * floor(income/50) rounds down to produce ..., -50, 0, 50, 100, ....

                      I recommend that system strongly over cryptic buckets ..., 1, 2, 3, ... It is no more complicated and much clearer than such labelling by successive integers.

                      50 * ceil(income/50) rounds up.