Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Returning the sum of certain observations in a by group

    I would like to sum the values of one variable, where values of another variable satisfy some condition. Here is some example data (based on this Stata help page)

    Code:
    clear
    
    input famid person female age income
    1 3 1 14 250
    1 2 1 16 300
    1 1 1 36 42325
    2 5 0 10 100
    2 4 1 12 150
    2 3 0 14 175
    2 2 1 42 48942
    2 1 0 45 47621
    3 6 1 3 25
    3 5 1 7 50
    3 4 1 9 75
    3 3 0 11 110
    3 2 1 36 41984
    3 1 0 39 42329
    end

    Within each family ID, for each person I would like to see the sum of income of other family members, but only of those family members that are up to 4 years older than each person.

    So for example, family 1 has a 14 yr old, and there is 1 other family member within the age range of being 4 years older, and the sum of that 1 person's income is 300, so I'd like the value 300 placed with the 14 yr old. The other two members of family 1 do not have anyone within a +4 year age range, so no value should be generated for the two other members.

    Family 2 has a 10 yr old, and there are 2 family members who are within +4 years age range, and the sum of income for those two members is 325, so the value 325 should be put with the 10 yr old record.
    Family 2 has a 12 yr old, with 1 other family member who is within +4 years age range, and the sum of income for that one person is 175, so the value 175 should be put with the 12 yr old record.

    So on and so forth.

    The attempts I've made so far are

    Code:
    bysort famid: egen newvar = total(income) if inrange(age, age, age+4)
    and

    Code:
    bysort famid: egen newvar = total(income) if inrange(age[_n], age[_n], age[_n]+4)
    but these do not work.



    Thanks for reading.

  • #2
    Just realized my if conditions

    Code:
     if inrange(age[_n], age[_n], age[_n]+4)
    are trivially true at all times.

    Another attempt I made is

    Code:
    bysort famid: egen newvar = total(income) if inrange(age[_n+1], age[_n], age[_n]+4)
    which feels closer but also does not work

    I also thought of a loop from 1 to _N in each group but it appears that multiline statements in a bysort command are not allowed

    Code:
    bysort famid: {
    gen newvar1 = (_n==_N)
    gen newvar2 = cond(_n==2, 1,45)
    }
    Last edited by Sean Fiedler; 11 May 2016, 11:00.

    Comment


    • #3
      As you've probably realised, your code attempts didn't work because nothing extends the inrange() scope beyond the current observation, even within a by: framework. Otherwise put, it's always true that age is within the interval age, age + 4 where age is interpreted as the value in the current observation.

      rangestat (SSC) is customised for problems such as these (and more widely).

      Code:
      clear
      
      input famid person female age income
      1 3 1 14 250
      1 2 1 16 300
      1 1 1 36 42325
      2 5 0 10 100
      2 4 1 12 150
      2 3 0 14 175
      2 2 1 42 48942
      2 1 0 45 47621
      3 6 1 3 25
      3 5 1 7 50
      3 4 1 9 75
      3 3 0 11 110
      3 2 1 36 41984
      3 1 0 39 42329
      end
      
      rangestat mean=income, interval(age 0 4) by(famid) excludeself
       
      list, sepby(famid)
      
      
           +------------------------------------------------+
           | famid   person   female   age   income    mean |
           |------------------------------------------------|
        1. |     1        3        1    14      250     300 |
        2. |     1        2        1    16      300       . |
        3. |     1        1        1    36    42325       . |
           |------------------------------------------------|
        4. |     2        5        0    10      100   162.5 |
        5. |     2        4        1    12      150     175 |
        6. |     2        3        0    14      175       . |
        7. |     2        2        1    42    48942   47621 |
        8. |     2        1        0    45    47621       . |
           |------------------------------------------------|
        9. |     3        6        1     3       25      50 |
       10. |     3        5        1     7       50    92.5 |
       11. |     3        4        1     9       75     110 |
       12. |     3        3        0    11      110       . |
       13. |     3        2        1    36    41984   42329 |
       14. |     3        1        0    39    42329       . |
           +------------------------------------------------+
      For rangestat, see the original announcement http://www.statalist.org/forums/foru...s-within-range

      or search the forum for mentions.

      EDIT: Note further that the help for egen explicitly warns against mixing egen calls with subscripting. The reason is that egen often entails temporary changes in the sort order, so subscripts may change.

      Last edited by Nick Cox; 11 May 2016, 11:06.

      Comment


      • #4
        Thank you Nick! This is perfect and does exactly what I was looking for.

        I made a small modification which I'll paste in case it helps someone else in the future


        Code:
        rangestat (sum) Newvar=income, interval(age 0 4) by(famid) excludeself

        Comment


        • #5
          Yes; you did say sum clearly. The default is the mean.

          Comment

          Working...
          X