Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to count number of observations per id

    Hi everyone,

    I have a large data with around 20,000 observations where V1 is the household ID that gives out one IDs per household. Another variable that I want to use is V2 which is labelled either 1,2,or 3. I want to figure out how many V2=3 there are for each household ID and make that as a new variable.

    For example, if this is what the data looks like:
    V1 V2 NEW VAR
    10 1 1
    10 2 1
    10 3 1
    11 2 0
    11 2 0
    If V1=10, I want to count how many responses from V2 are 3 and make that as a new variable.

    I have read some threads on this matter and tried egen, count method that was recommended but did not succeed.
    I feel like this is a simple problem and I am sorry for the repetition in question.

    Thank you in advance!

    P.S. I use STATA 14.0 for Mac
    Last edited by Graceh Kim; 23 Sep 2020, 20:05.

  • #2
    Maybe something along these lines.
    Code:
    bysort V1: egen long tot3 = sum(V2 == 3)
    .ÿ
    .ÿversionÿ16.1

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿinputÿbyte(V1ÿV2ÿNEW_VAR)

    ÿÿÿÿÿÿÿÿÿÿÿV1ÿÿÿÿÿÿÿÿV2ÿÿÿNEW_VAR
    ÿÿ1.ÿ10ÿ1ÿ1
    ÿÿ2.ÿ10ÿ2ÿ1
    ÿÿ3.ÿ10ÿ3ÿ1
    ÿÿ4.ÿ11ÿ2ÿ0
    ÿÿ5.ÿ11ÿ2ÿ0
    ÿÿ6.ÿ12ÿ1ÿÿ.
    ÿÿ7.ÿ12ÿ3ÿÿ.
    ÿÿ8.ÿ12ÿ3ÿÿ.
    ÿÿ9.ÿend

    .ÿÿ
    .ÿbysortÿV1:ÿegenÿlongÿtot3ÿ=ÿsum(V2ÿ==ÿ3)ÿ//ÿ<=ÿHere

    .ÿ
    .ÿlistÿV?ÿtot3,ÿnoobsÿsepby(V1)

    ÿÿ+----------------+
    ÿÿ|ÿV1ÿÿÿV2ÿÿÿtot3ÿ|
    ÿÿ|----------------|
    ÿÿ|ÿ10ÿÿÿÿ1ÿÿÿÿÿÿ1ÿ|
    ÿÿ|ÿ10ÿÿÿÿ2ÿÿÿÿÿÿ1ÿ|
    ÿÿ|ÿ10ÿÿÿÿ3ÿÿÿÿÿÿ1ÿ|
    ÿÿ|----------------|
    ÿÿ|ÿ11ÿÿÿÿ2ÿÿÿÿÿÿ0ÿ|
    ÿÿ|ÿ11ÿÿÿÿ2ÿÿÿÿÿÿ0ÿ|
    ÿÿ|----------------|
    ÿÿ|ÿ12ÿÿÿÿ1ÿÿÿÿÿÿ2ÿ|
    ÿÿ|ÿ12ÿÿÿÿ3ÿÿÿÿÿÿ2ÿ|
    ÿÿ|ÿ12ÿÿÿÿ3ÿÿÿÿÿÿ2ÿ|
    ÿÿ+----------------+

    .ÿ
    .ÿexit

    endÿofÿdo-file


    .

    Comment


    • #3
      Maybe this gets you closer to a solution that works for you:

      Code:
      clear
      
      input V1    V2
      10    1    
      10    2    
      10    3    
      11    2    
      11    2    
      end
      
      collapse (count) V2, by(V1)
      
      list

      Comment


      • #4
        Graceh Kim did you try:

        bys V1: egen v2count = count(V2) if V2==3


        bysort V1 V2 (v2count): replace v2count=v2count[_N]
        replace v2count = 0 if v2count==.

        Comment


        • #5
          Joseph Coveney Thank you this worked just as I anticipated.
          Alan Neustadtl Thank you so much for the input!
          Tom Scott This worked similarly as the code Mr.Coveney suggested, but only one row was filled up. Thank you so much for the input!

          Comment


          • #6
            The easiest thing to do here is what Joseph did, and we should note that since Stata 8 or so we are not supposed to call this function sum, but rather total. And this is so because there is a sum function which goes with gen and calculates the running sum.

            Code:
            . egen total = total(v2==3), by(v1)
            Another solution would be without any egen, and with the running sum

            Code:
            . bysort v1: gen runsum = sum(v2==3)
            
            . by v1: replace runsum=runsum[_N]
            (2 real changes made)
            And yet another exotic solution would be using count and division by zero.

            Code:
            . egen count = count(1/(v2==3)), by(v1)
            
            
            . list, sepby(v1)
            
                 +-------------------------------------------+
                 | v1   v2   newvar   total   runsum   count |
                 |-------------------------------------------|
              1. | 10    1        1       1        1       1 |
              2. | 10    2        1       1        1       1 |
              3. | 10    3        1       1        1       1 |
                 |-------------------------------------------|
              4. | 11    2        0       0        0       0 |
              5. | 11    2        0       0        0       0 |
                 +-------------------------------------------+

            Comment


            • #7
              I don't think anyone is trying to forbid or do more than mildly discourage naming sum() as a function that works with egen. The main risk is that you will puzzle anyone who tries to read the documentation, seemingly a group at risk of extinction.

              egen had a documented sum() function until Svend Juul gave a very witty talk in Berlin pointing out, among other things, that egen, sum() gives totals and sum() as a Stata function gives cumulative or running totals, and how is that consistent? Svend's slides are at https://www.stata.com/meeting/2german/Juul.pdf and some of the content from 2004 is still fair comment.

              Irrelevant digression: Svend's talk went down so well in Berlin that StataCorp asked him to visit Texas if he was ever nearby, and give it all over again, and he did. Unfortunately someone high up in the company knocked a jug of water all over him just before his talk. Accidentally. But the company have, surprise, Stata solutions for Stata problems. Someone rushed down to marketing and came back with two new Stata polo shirts for him to change.

              Answer: it's not, so the company took egen, sum() undocumented in Stata 9 and documented a total() function instead, But they didn't want to break any existing code and so sum() continues to work,

              Mata has runningsum() as a name, which is a little long, but explicit. There are all sorts of reasons, some very delicate, why cusum() or some variant on that is probably not a good idea.

              My own idea is that runningsum() should be a synonym for sum() as a Stata function and the latter should go undocumented, but that might not be popular either.

              Similarly, I think we would be better off if date() went undocumented and its existing synonym daily() were publicised more instead. Despite the documentation, too often date() is guessed to be a generic date function that reads the mind of the user and produces the kind of date variable, they would prefer.

              Comment

              Working...
              X