Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting unique ids

    When I use the following code:

    Code:
    bys month: count(refer_id)
    Is there anything I can add to it that only counts unique refer_ids per month? For instance, if the refer_id 1822 shows up twice in a month, it only counts it once?
    Thanks!

  • #2
    No, that code gives an error message. I think you meant -bys month: egen n_refer_ids = count(refer_id)-. To get the number of distinct values of refer_id you can do
    Code:
    bys month (refer_id): egen n_distinct_refer_ids = total(refer_id != refer_id[_n-1])

    Comment


    • #3
      Hi Collin. Clyde has given you a method that stores the number of distinct IDs per month as a variable--and if that's what you want, great. But your post suggested that you might just want the output from -count- rather than a new variable. If so, see the example below. HTH.

      Code:
      . * Generate some data to illustrate.
      . clear
      
      . input byte month refer_id
      
              month   refer_id
        1. 1  1001
        2. 1  1001
        3. 1  1002
        4. 1  1002
        5. 2  1001
        6. 2  1002
        7. 2  1002
        8. 2  1003
        9. 3  1001
       10. 3  1002
       11. 3  1003
       12. 3  1003
       13. 4  1001
       14. 4  1002
       15. 4  1003
       16. 4  1004
       17. 4  1005
       18. end
      
      .
      . * Clyde's code generates a new variable
      . bys month (refer_id): egen n_distinct_refer_ids = total(refer_id != refer_id[_
      > n-1])
      
      . list, clean
      
             month   refer_id   n_dist~s  
        1.       1       1001          2  
        2.       1       1001          2  
        3.       1       1002          2  
        4.       1       1002          2  
        5.       2       1001          3  
        6.       2       1002          3  
        7.       2       1002          3  
        8.       2       1003          3  
        9.       3       1001          3  
       10.       3       1002          3  
       11.       3       1003          3  
       12.       3       1003          3  
       13.       4       1001          5  
       14.       4       1002          5  
       15.       4       1003          5  
       16.       4       1004          5  
       17.       4       1005          5  
      
      .
      . * If Collin really wants output from the -count- command,
      . * he can do this:
      . egen to_use = tag(month refer_id)
      
      . bysort month: count if to_use
      
      --------------------------------------------------------------------------------
      -> month = 1
        2
      --------------------------------------------------------------------------------
      -> month = 2
        3
      --------------------------------------------------------------------------------
      -> month = 3
        3
      --------------------------------------------------------------------------------
      -> month = 4
        5

      Here is the code without output:

      Code:
      * Generate some data to illustrate.
      clear
      input byte month refer_id
      1  1001
      1  1001
      1  1002
      1  1002
      2  1001
      2  1002
      2  1002
      2  1003
      3  1001
      3  1002
      3  1003
      3  1003
      4  1001
      4  1002
      4  1003
      4  1004
      4  1005
      end
      
      * Clyde's code generates a new variable
      bys month (refer_id): egen n_distinct_refer_ids = total(refer_id != refer_id[_n-1])
      list, clean
      
      * If Collin really wants output from the -count- command,
      * he can do this:
      egen to_use = tag(month refer_id)
      bysort month: count if to_use
      --
      Bruce Weaver
      Email: [email protected]
      Version: Stata/MP 18.5 (Windows)

      Comment


      • #4
        Thank you both! I appreciate it.

        Comment


        • #5
          See https://www.stata-journal.com/articl...article=dm0042 for a survey of this territory -- including the detail of why distinct is a much better term than unique.

          @Clyde Schechter's code may work, but note that the help for egen warns against mixing egen calls with subscript referencing. The reason is that egen feels free to change your sort order temporarily for its own purposes -- even though it returns the dataset to its original order when it is done.

          There is an
          egen function nvals() in egenmore from SSC, but I don't use it any more. Belatedly -- as documented in the paper referenced above -- I realised that although

          Code:
          egen tag = tag(month refer_id) 
          egen distinct = total(tag), by(month)
          is two lines, not one (you noticed), the variable
          tag is often useful any way.

          The overlap with the solutions from Clyde and from Bruce Weaver is natural here.

          Comment


          • #6
            Nick Cox makes a good point about the potential problems of using -egen- and subscript numbers. The following code does the same thing more safely:

            Code:
            by month (refer_id), sort: gen n_distinct_refer_ids = sum(refer_id != refer_id[_n-1])
            by month (refer_id): replace n_distinct_refer_ids = n_distinct_refer_ids[_N]

            Comment


            • #7
              Note that egen functions inside are often doing things like Clyde Schechter's code in #6. In fact such code will usually be much faster. (egen code tends to be more general and to add stuff like support for if and in and treatment of missing values.)

              There is always a downside in computing. Writing stuff like #6 requires fluency with all of changing sort order, the by: prefix, subscripting and Stata's own functions. But that's the sort of fluency that anyone who is using Stata almost every day tends to acquire after some weeks. Or more slowly, as the case may be.

              The egen route isn't painless either. You can waste a lot of time looking for the right function even if it exists and the answer may be simple once you see it but not otherwise (e.g. a need to use two egen functions and some extra stuff).

              Comment

              Working...
              X