Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to calculate the cumulative mean by groups?

    For example, for the observation group 1 time 1, the cumulative mean is missing; for observation group 1 time 2, the cumulative mean is the average of previous observations, namely 74; for observation group 1 time 3, the cumulative mean is the average of previous observations, namely avarage of 74 and 85.5.

    Thanks a ton in advance!

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float group byte time double x
    1  1   74
    1  2 85.5
    1  3 83.3
    1  4 83.7
    1  5   53
    1  6   81
    1  7   72
    1  8 89.9
    1  9 85.3
    1 10 87.5
    1 12 82.8
    1 13 79.2
    1 15 80.8
    1 16 85.2
    2  1   62
    2  2   73
    2  3   63
    2  4   63
    2  5   78
    2  6   68
    end

  • #2
    What you want is a mean defined over a range of observations. See rangestat from SSC. I am assuming nonconsecutive time periods imply missing values.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float group byte time double x
    1  1   74
    1  2 85.5
    1  3 83.3
    1  4 83.7
    1  5   53
    1  6   81
    1  7   72
    1  8 89.9
    1  9 85.3
    1 10 87.5
    1 12 82.8
    1 13 79.2
    1 15 80.8
    1 16 85.2
    2  1   62
    2  2   73
    2  3   63
    2  4   63
    2  5   78
    2  6   68
    end
    
    qui sum time
    rangestat (mean) x, interval(time `=-`r(max)'' -1) by(group)
    Res.:

    Code:
    . l, sepby(gr)
    
         +---------------------------------+
         | group   time      x      x_mean |
         |---------------------------------|
      1. |     1      1     74           . |
      2. |     1      2   85.5          74 |
      3. |     1      3   83.3       79.75 |
      4. |     1      4   83.7   80.933333 |
      5. |     1      5     53      81.625 |
      6. |     1      6     81        75.9 |
      7. |     1      7     72       76.75 |
      8. |     1      8   89.9   76.071429 |
      9. |     1      9   85.3        77.8 |
     10. |     1     10   87.5   78.633333 |
     11. |     1     12   82.8       79.52 |
     12. |     1     13   79.2   79.818182 |
     13. |     1     15   80.8   79.766667 |
     14. |     1     16   85.2   79.846154 |
         |---------------------------------|
     15. |     2      1     62           . |
     16. |     2      2     73          62 |
     17. |     2      3     63        67.5 |
     18. |     2      4     63          66 |
     19. |     2      5     78       65.25 |
     20. |     2      6     68        67.8 |
         +---------------------------------+
    
    .

    Comment


    • #3
      Originally posted by Andrew Musau View Post
      What you want is a mean defined over a range of observations. See rangestat from SSC. I am assuming nonconsecutive time periods imply missing values.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float group byte time double x
      1 1 74
      1 2 85.5
      1 3 83.3
      1 4 83.7
      1 5 53
      1 6 81
      1 7 72
      1 8 89.9
      1 9 85.3
      1 10 87.5
      1 12 82.8
      1 13 79.2
      1 15 80.8
      1 16 85.2
      2 1 62
      2 2 73
      2 3 63
      2 4 63
      2 5 78
      2 6 68
      end
      
      qui sum time
      rangestat (mean) x, interval(time `=-`r(max)'' -1) by(group)
      Res.:

      Code:
      . l, sepby(gr)
      
      +---------------------------------+
      | group time x x_mean |
      |---------------------------------|
      1. | 1 1 74 . |
      2. | 1 2 85.5 74 |
      3. | 1 3 83.3 79.75 |
      4. | 1 4 83.7 80.933333 |
      5. | 1 5 53 81.625 |
      6. | 1 6 81 75.9 |
      7. | 1 7 72 76.75 |
      8. | 1 8 89.9 76.071429 |
      9. | 1 9 85.3 77.8 |
      10. | 1 10 87.5 78.633333 |
      11. | 1 12 82.8 79.52 |
      12. | 1 13 79.2 79.818182 |
      13. | 1 15 80.8 79.766667 |
      14. | 1 16 85.2 79.846154 |
      |---------------------------------|
      15. | 2 1 62 . |
      16. | 2 2 73 62 |
      17. | 2 3 63 67.5 |
      18. | 2 4 63 66 |
      19. | 2 5 78 65.25 |
      20. | 2 6 68 67.8 |
      +---------------------------------+
      
      .
      Thank you! Could you please explain more about "interval(time `=-`r(max)'' -1)"? I know it's the range of observations, so what do you mean by setting "-`r(max)" and "-1"?

      Comment


      • #4
        From

        Code:
        help rangestat
        interval(keyvar low high) is required and defines the interval that selects the set of observations to use to calculate result for the current observation. keyvar
        is a numeric variable. Observations whose values for keyvar fall within the closed interval bounds are selected. low and high can each be specified using a
        numeric variable, a # (a number in Stata parlance), or a system missing value. If a # is used, the bound for each observation is computed by adding # to
        keyvar. If low is specified using a system missing value, low is set to missing for all observations. rangestat applies the same rules as inrange() for
        missing bounds: if the lower bound is missing, observations will match up to and including the value of high. If both low and high are missing, all
        observations will match. Note that the treatment of missing values for low and high differs in version 1.1 up from the previous version of rangestat and this
        may require that previous code be adapted. (Use which to find out which version you are running if you do not know.)
        -r(max)- after summarize is the maximum value of the summarized statistic. So the key var is time, low bound is (-max time), high bound is (-1), i.e., the previous observation (time-1) if sorting by time and no holes in the panel.

        Comment


        • #5
          Originally posted by Andrew Musau View Post
          From

          Code:
          help rangestat


          -r(max)- after summarize is the maximum value of the summarized statistic. So the key var is time, low bound is (-max time), high bound is (-1), i.e., the previous observation (time-1) if sorting by time and no holes in the panel.
          Thanks again! Finally I understand this explanation, the interval defines the location of key variables compared to the current observation, right? For example, interval (time . .) indicates all observations? interval (time . -1) indicates the observations from the first one to the previous one? interval (time 0 0) indicates the current observations? Are these understandings right?

          Comment


          • #6
            That’s correct. In this context system missing means as large as possible, whether it’s a subtraction (e.g. looking back in time) or an addition (e.g. looking forward).

            Comment


            • #7
              Originally posted by Nick Cox View Post
              That’s correct. In this context system missing means as large as possible, whether it’s a subtraction (e.g. looking back in time) or an addition (e.g. looking forward).
              Thanks Nick!

              Comment

              Working...
              X