Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimmed Mean

    Hi Stata Community,

    I hope you are all doing well. I am afraid I have a question regarding the calculation of the mean inside a centile range. I would like to calculate the mean of the variable change1m_ inside the centile range c(24 69) for every year since 2000. My code is the following:

    HTML Code:
    putexcel set Trimmed_PCE.xlsx, sheet(data) replace
    putexcel A1 = "Year"
    putexcel B1 = "Month"
    putexcel C1 = "Trimmed mean"
    local row = 2
    forvalues y = 2000(1)2021 {
    forvalues m = 1(1)12{
    preserve
    keep if year==`y' & month==`m'
    gsort -change1m_
    centile change1m_, c(24 69)
    asgen wtm_`y'`m' = change1m_ if inrange( change1m_, `r(c_1)', `r(c_2)'), w(weight_)
    summarize wtm_`y'`m'
    putexcel A`row' = `y'
    putexcel B`row' = `m'
    putexcel C`row' = `r(mean)'
    local ++row
    restore
    }
    }
    I loop though the years, sot the values in descending order, and then hope to calculate the mean of the values that fall inside the range. I then inout into Excel. However, I think this calculates the weighted mean or something else that is not quite right. I would like the "pure" mean of the values that fall inside c(24 69). How do I do that? Your help would be much appreciated. Thanks!

  • #2
    Maya:
    the following toy-example might be helpful:
    Code:
    . use "https://www.stata-press.com/data/r16/nlswork.dta"
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . centile ln_wage , centile(24 69)
    
                                                           -- Binom. Interp. --
        Variable |       Obs  Percentile    Centile        [95% Conf. Interval]
    -------------+-------------------------------------------------------------
         ln_wage |    28,534         24    1.347073        1.342716    1.353974
                 |                   69    1.872316        1.865443    1.880239
    
    . return list
    
    scalars:
                 r(n_cent) =  2
                      r(N) =  28534
                   r(ub_2) =  1.880238592405991
                   r(lb_2) =  1.865442991256714
                    r(c_2) =  1.872315907478332
                   r(ub_1) =  1.353973627090454
                   r(lb_1) =  1.342715978622437
                    r(c_1) =  1.347073435783386
    
    macros:
               r(centiles) : "24 69"
    
    . gen flag=1 if ln_wage>=r(c_1) & ln_wage<=r(c_2)
    (15,662 missing values generated)
    
    . mean ln_wage if flag==1, over(year)
    
    Mean estimation                     Number of obs   =     12,872
    
    ----------------------------------------------------------------
                   |       Mean   Std. Err.     [95% Conf. Interval]
    ---------------+------------------------------------------------
    c.ln_wage@year |
               68  |   1.573651   .0051226      1.563609    1.583692
               69  |   1.564508   .0053484      1.554024    1.574992
               70  |   1.576524   .0048386      1.567039    1.586008
               71  |   1.593204   .0047407      1.583912    1.602497
               72  |   1.604837   .0051794      1.594685     1.61499
               73  |   1.603122   .0048026      1.593708    1.612536
               75  |   1.599683   .0047019      1.590467      1.6089
               77  |   1.617797   .0047435      1.608499    1.627095
               78  |   1.621563   .0048568      1.612042    1.631083
               80  |   1.613738   .0052056      1.603534    1.623942
               82  |   1.618086   .0050091      1.608268    1.627905
               83  |   1.623598   .0052505      1.613306     1.63389
               85  |   1.628742   .0053526       1.61825    1.639234
               87  |   1.618079    .005516      1.607267    1.628891
               88  |   1.616127   .0054075      1.605528    1.626727
    ----------------------------------------------------------------
    
    .
    If you have years and months, you can benefit from the -group- function available from -egen-:
    Code:
    egen wanted=group(year month)
    mean <whatyouwant> if flag==1, over(wanted)
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Carlo,

      thank you very much, that is incredibly useful! Really appreciate you taking time. Please allow me one last question as to the implementation in the final loop. I am calculating the centiles over a weight column that already exists. As a means of checking, I insert the cum column so I can visually see where the cutoff is. However, I am struggling to loop over the mean of the variable change1m_, which is the one of interest. I know you suggested using an -egen function but since I am looping over 12 months times 21 years, I would need to loop over that and I am unsure how to do that. Would you be able to help me with that?

      HTML Code:
      putexcel set Trimmed_PCE.xlsx, sheet(data) replace
      putexcel A1 = "Year"
      putexcel B1 = "Month"
      putexcel C1 = "Trimmed mean"
      local row = 2
      forvalues y = 2021(1)2021 {
         forvalues m = 1(1)12{
         preserve
         keep if year==`y' & month==`m'
         gsort -change1m_
       
         gen cum`y'`m' = sum(weight)
         generate flag`y'`m' = 1
         replace flag`y'`m' = 0 if cum`y'`m'>69 | cum`y'`m' < 24
         
         mean change1m_ if flag ==1
         
         summarize mean_`y'`m'
         putexcel A`row' = `y'
         putexcel B`row' = `m'
         putexcel C`row' = `r(mean)'
         local ++row
          restore
          }
      }

      Thank you very much. Best wishes, Maya

      Comment


      • #4
        Maya:
        I find difficult to reply positively without an excerpt/example of your dataset, that you can share via -dataex-.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Dear Carlo,

          thank you for your patience. The "weight_" variable is given and in this case determines where the 24/69 cutoff is; it is the weight associated with the category (i.e. the CPI item). I would like to calculate the mean of all the change1m_ that fall within that range. I would like to loop over all years and then input into excel. Does that make sense? Apologies if it still is unclear.

          id category change1m_ weight_ year month week modate cum flag
          261 USPXHHLIIndex 152.4045 0.327972 2021 9 39 2021m9 0.327972 0
          261 USPXBEEFIndex 74.85259 0.381732 2021 9 39 2021m9 0.709703 0
          261 USPXPRRCIndex 4.164208 0.014825 2021 9 39 2021m9 64.33695 1
          261 USPXMCAHIndex 4.105961 1.421673 2021 9 39 2021m9 65.75863 1
          261 USPXPASHIndex 4.001945 -0.19829 2021 9 39 2021m9 65.56034 1
          261 USPXOVFHIndex -56.8583 0.027497 2021 9 39 2021m9 99.42583 0
          261 USPXAITAIndex -71.5311 0.574114 2021 9 39 2021m9 99.99995 0
          Thank you very much for your support and providing a public good. Best wishes Maya.

          Comment


          • #6
            Maya:
            as far as I can see (without a -dataex- excerpt of your data any test from my side gets too demanding) your code seems to do what you want.
            I can't comment on inputting results to Excel, as I've never challenged myself with that.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7

              Hi Carlo, please allow me to ask then how to loop over
              HTML Code:
                 mean change1m_ if flag ==1
              I currently don't manage to get the loop over the mean of change1m_ right. For every loop over year and month, I'd like to get something like

              HTML Code:
              generate mean`y'`m' = mean change1m_ if flag ==1
              so that I get a mean for change1m_2021sep, change1m_2021aug, change1m_2021july .. Obviously, generate x = mean (of variable of interest) does not work but this is what I need and I not sure how to get there. Does that make sense? Thanks so much for helping me, I think I'm almost there!

              Comment


              • #8
                Maya:
                in the following (totally fool) toy-example, I've created a double loop that might be helpful for the chink of code you're complaining about:
                Code:
                set obs 10
                set obs 12
                g year=_n+2000
                g month=_n
                g mean=runiform()
                g flag=1 in 1/3
                replace flag=1 in 6/8
                replace flag=0 if flag==.
                forvalues i = 2001(1)2012 {
                forvalues j = 1(1)12 {
                g mean_`i'_`j'=mean if flag==1
                 }
                 }
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Dear Carlo,
                  thank you so very much! This is incredibly useful. Very final question if you permit - is the
                  HTML Code:
                    mean_`i'_`j'
                  stored in a local macro? That would be useful because it would allow me to use put excel with it. However, if I use
                  HTML Code:
                    summarize mean_`i'_`j'
                  and then access

                  HTML Code:
                    `r(mean)'
                  this is obviously a different meant to the one I had calculated in the loop. Is there a way around that?

                  Comment


                  • #10
                    Maya:
                    I've never exported Stata results to spreadsheets.
                    Sorry I cannot be more helpful.
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment

                    Working...
                    X