Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to find median number of observations per group

    Hello,

    I would like some advice on how to get a summary stat from a dataset that is a bit complicated. In this dataset, each row is a medical record. There is a column called id where the person's id number is displayed but the same id does appear more than once, meaning the same person may have several medical records. I need to find the median number of records per person. (To be clear I need 1 number representing the average number of records a person in this dataset has.)

    I have tried using egen to create groups based on id as well as the collapse command but so far have not been able to figure this out. I would appreciate any help!

    Thank you,
    Elaine

  • #2
    Something like this:
    Code:
    by person_id, sort: gen long num_records = _N
    summ num_records, detail
    local "Median number of records per person = " %2.1f  =`r(p50)'
    Note: Because no example data was provided, this code is untested and may contain typos or other errors. In the future, when asking for help with code, always provide example data. While it is sometimes possible to guess the structure and nature of the data without an example, when those guesses are wrong, those who help you waste their time writing code that cannot possibly work, and you waste yours attempting to run it and then re-posting about that problem. So always show example data when asking for help with code.

    The helpful way to show example data is by using the -datatex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.


    Comment


    • #3
      Consider this toy dataset:

      Code:
           +----+
           | id |
           |----|
        1. |  1 |
           |----|
        2. |  2 |
        3. |  2 |
           |----|
        4. |  3 |
        5. |  3 |
        6. |  3 |
        7. |  3 |
           +----+
      The number of identifiers is 3 and the median number of records is 2 (from 1, 2, 4). But @Clyde Schechter's code will give 3, because each number of records is counted that many times. I take it that you want the median across people, not records. If so, then @Clyde's code needs a tweak:


      Code:
       
       by person_id, sort: gen long num_records = _N egen tag = tag(person_id) summ num_records if tag, detail

      Comment


      • #4
        Thanks for your help; this worked.

        Apologies for not including sample data I will be sure to do so next time.

        Comment


        • #5
          Yes, Nick is right. Actually, the way I usually handle this is:
          Code:
          by person_id, sort: gen long num_records = _N if _n == 1
          summ num_records, detail
          local "Median number of records per person = " %2.1f =`r(p50)'
          That avoids creating an extra variable that might not otherwise be needed.

          Comment


          • #6
            local should be display in #2 and #5.

            Comment


            • #7
              Yes, right again, Nick. Not sure what my brain was doing when I wrote those!?!

              Comment

              Working...
              X