Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Possible to select only unique observations (without dropping duplicates)?

    I'm working on a dataset with about 45,000 observations for about 5500 unique id's.
    Is there a way to do e.g. tab of only the observations with a unique id? I know I can drop duplicates, but I need them later. However for some of my analysis I only want to display the observations that have a unique id.
    Any thoughts?
    Thank you!
    ps I work in Stata 13.1/IC on Mac.

  • #2
    You have not told us about the purpose or about your data set. An example of data set using dataex from SSC will help us in pinpointing what you want to accomplish. Anyway, there are several alternatives of what you want to dol;
    • First one is to preserve your data set,
    • then drop duplicates
    • then restore.
    lternatively, you can drop duplicates, save a temporary file, and then reload the original file for other uses.
    Regards
    --------------------------------------------------
    Attaullah Shah, PhD.
    Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
    FinTechProfessor.com
    https://asdocx.com
    Check out my asdoc program, which sends outputs to MS Word.
    For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

    Comment


    • #3
      On using the term distinct rather than unique, see early comments in http://www.stata-journal.com/sjpdf.h...iclenum=dm0042 The whole paper may be useful too.

      I think what you want is along these lines


      Code:
      egen tag = tag(id) 
      ... if tag 
      which works with just one example (the first occurrence) of each distinct identifier, so that for example

      Code:
      list id if tag, noobs
      shows the distinct identifiers just once each.


      Comment


      • #4
        Thank you. I am an epidemiologist working on infectious diseases. My id var refers to individuals and my observations are e.g. about when people got tested (and many got testet multiple times).
        I will try with preserve and restore. Was hoping it would be possible with an "if command" something like if id == "unique" - but I guess that is not possible.

        Kind regards from Madrid, Spain.

        Stine Nielsen

        Phd student, EPIET alumni and freelance epidemiologist
        LinkedIn: www.linkedin.com/in/stinenielsenepi
        Twitter: www.twitter.com/StineNielsenEPI


        Comment


        • #5
          I think you're replying to Attaullah in #2. It really is possible with an if qualifier (not command).

          Comment


          • #6
            Thx Nick! Yes I replied to post#2 before reading your reply.
            Code:
            egen tag = tag(id) 
            tab sex if tag==1
            works for me.
            Thanks again.

            Comment


            • #7
              Is there a way to make sure that I always tag the most recent observation per id?

              I tried this, but it didn't work:
              Code:
              bysort id (datevar) : egen tag2 = tag(id)
              Each id has many observations and I want to choose / tag the observation with the most recent date
              Code:
              max(datevar)
              per id.
              Any tips on how to do that?

              Comment


              • #8
                Code:
                 
                 bysort id (datevar) : gen byte is_last = _n == _N

                Comment


                • #9
                  Almost works. But I need it to ignore missing values in datevar - I tried adding
                  Code:
                  if datevar!=.
                  at the end of your suggested code - but this didn't work well (gave me several id's where the observation with the last datevar was not included in "is_last".
                  More tips?

                  Comment


                  • #10


                    Code:
                    gen ismissing = -missing(datevar) 
                    bysort id (ismissing datevar) : gen byte is_last = _n == _N
                    Note particularly the minus sign.

                    missing(datevar) is 1 if the argument is missing and 0 otherwise. If we negate that (-1 and 0), then missings are sorted first for each individual.

                    Comment


                    • #11
                      Fantastic! Thanks so much!

                      Comment

                      Working...
                      X