Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count the number of unique observation conditional on variable

    Hi Statlist users,

    I am currently having an issue counting the unique number of observation conditional on another variable.

    In particular, I want to do the following:
    Session_district Facility_id Count
    1001 A 2
    1001 B 3
    1001 A 2
    1002 B 3
    1002 A 2
    1003 B 3
    1003 B 3
    1001 C 1
    1001 C 1
    1001 C 1
    For each facility_id, I want to count the number of observation based on the unique value of session_district.
    Hence, as illustrated in the example, although facility_id will occur distinctly 3 times, because there are only two unique session_district, the count would be 2.


    I tried code:
    egen count = nvals(session_district), by(facility_id) would give me count for distinct observations by session_district, i.e. facility_id A would get count equal to 3, instead of 2, which is what I want.

    Could someone provide some help with this? I could not really find the answer to this elsewhere.
    Thanks!







  • #2
    Welcome to Statalist.
    Code:
    cls
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int session_district str1 facility_id byte count
    1001 "A" 2
    1001 "B" 3
    1001 "A" 2
    1002 "B" 3
    1002 "A" 2
    1003 "B" 3
    1003 "B" 3
    1001 "C" 1
    1001 "C" 1
    1001 "C" 1
    end
    egen tocount = tag(facility_id session_district)
    bysort facility_id (session_district) : egen newcount = total(tocount)
    list, sepby(facility_id) abbreviate(16) noobs\
    Code:
    . list, sepby(facility_id) abbreviate(16) noobs
    
      +-------------------------------------------------------------+
      | session_district   facility_id   count   tocount   newcount |
      |-------------------------------------------------------------|
      |             1001             A       2         0          2 |
      |             1001             A       2         1          2 |
      |             1002             A       2         1          2 |
      |-------------------------------------------------------------|
      |             1001             B       3         1          3 |
      |             1002             B       3         1          3 |
      |             1003             B       3         1          3 |
      |             1003             B       3         0          3 |
      |-------------------------------------------------------------|
      |             1001             C       1         0          1 |
      |             1001             C       1         1          1 |
      |             1001             C       1         0          1 |
      +-------------------------------------------------------------+

    Comment


    • #3
      Your example is very puzzling. You show a data example with a variable Count and also

      Code:
      egen count = nvals(session_district), by(facility_id)
      as supposedly what produced it. In principle the result of the nvals() function from egenmore (from SSC, as you are asked to explain) will always be constant for each distinct value of the by() argument.

      Correcting for inconsistencies in upper and lower case I get this, which seems right to me:

      Code:
      clear
      input Session_district    str1 Facility_id    Count
      1001    A    2
      1001    B    3
      1001    A    2
      1002    B    3
      1002    A    2
      1003    B    3
      1003    B    3
      1001    C    1
      1001    C    1
      1001    C    1
      end
      
      sort Session_district Facility_id
      egen wanted = nvals(Facility_id), by(Session_district)
      
      list, sepby(Session_district)
      
           +--------------------------------------+
           | Sessio~t   Facili~d   Count   wanted |
           |--------------------------------------|
        1. |     1001          A       2        3 |
        2. |     1001          A       2        3 |
        3. |     1001          B       3        3 |
        4. |     1001          C       1        3 |
        5. |     1001          C       1        3 |
        6. |     1001          C       1        3 |
           |--------------------------------------|
        7. |     1002          A       2        2 |
        8. |     1002          B       3        2 |
           |--------------------------------------|
        9. |     1003          B       3        1 |
       10. |     1003          B       3        1 |
           +--------------------------------------+
      Please note from http://www.statalist.org/forums/help#stata key advice for questions here:

      * Show a data example using dataex

      * Show the exact code you used, not a rewriting

      * Explain community-contributed commands you refer to.

      https://www.stata-journal.com/sjpdf....iclenum=dm0042 makes a case that "unique" is never the best word unless you're referring to values that occur precisely once: "distinct" is recommended instead.

      Note: William Lisowski uses in #2 a method also recommended in dm0042, as making use in two lines of "official" egen functions, rather than downloading a community-contributed one. But naturally it is the same answer either way.
      Last edited by Nick Cox; 14 May 2019, 12:26.

      Comment


      • #4
        I was confused in #3 by the sort order in #1, which seemed haphazard. Sorry about that. But if this is what you wanted then it matches your result, so I still don't understand the question, but for a different reason.

        Code:
             +--------------------------------------+
             | Sessio~t   Facili~d   Count   wanted |
             |--------------------------------------|
          1. |     1001          A       2        2 |
          2. |     1001          A       2        2 |
          3. |     1002          A       2        2 |
             |--------------------------------------|
          4. |     1001          B       3        3 |
          5. |     1002          B       3        3 |
          6. |     1003          B       3        3 |
          7. |     1003          B       3        3 |
             |--------------------------------------|
          8. |     1001          C       1        1 |
          9. |     1001          C       1        1 |
         10. |     1001          C       1        1 |
             +--------------------------------------+

        Comment


        • #5
          #4 was incomplete. Here's the whole story:

          Code:
          clear
          input Session_district    str1 Facility_id    Count
          1001    A    2
          1001    B    3
          1001    A    2
          1002    B    3
          1002    A    2
          1003    B    3
          1003    B    3
          1001    C    1
          1001    C    1
          1001    C    1
          end
          
          sort  Facility_id Session_district
          egen wanted = nvals(Session_district), by(Facility_id) 
          
          list, sepby(Facility_id) 
          
               +--------------------------------------+
               | Sessio~t   Facili~d   Count   wanted |
               |--------------------------------------|
            1. |     1001          A       2        2 |
            2. |     1001          A       2        2 |
            3. |     1002          A       2        2 |
               |--------------------------------------|
            4. |     1001          B       3        3 |
            5. |     1002          B       3        3 |
            6. |     1003          B       3        3 |
            7. |     1003          B       3        3 |
               |--------------------------------------|
            8. |     1001          C       1        1 |
            9. |     1001          C       1        1 |
           10. |     1001          C       1        1 |
               +--------------------------------------+

          Comment

          Working...
          X