Possible to select only unique observations (without dropping duplicates)?

Stine Nielsen

Join Date: Jan 2017

Posts: 8
#1

Possible to select only unique observations (without dropping duplicates)?

31 Jan 2017, 09:06

I'm working on a dataset with about 45,000 observations for about 5500 unique id's.
Is there a way to do e.g. tab of only the observations with a unique id? I know I can drop duplicates, but I need them later. However for some of my analysis I only want to display the observations that have a unique id.
Any thoughts?
Thank you!
ps I work in Stata 13.1/IC on Mac.
Tags: None
Attaullah Shah

Join Date: Aug 2014

Posts: 1669
#2

31 Jan 2017, 09:33

You have not told us about the purpose or about your data set. An example of data set using dataex from SSC will help us in pinpointing what you want to accomplish. Anyway, there are several alternatives of what you want to dol;
First one is to preserve your data set,

then drop duplicates

then restore.

lternatively, you can drop duplicates, save a temporary file, and then reload the original file for other uses.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

31 Jan 2017, 09:45

On using the term distinct rather than unique, see early comments in http://www.stata-journal.com/sjpdf.h...iclenum=dm0042 The whole paper may be useful too.

I think what you want is along these lines

Code:

egen tag = tag(id) ... if tag

which works with just one example (the first occurrence) of each distinct identifier, so that for example

Code:

list id if tag, noobs

shows the distinct identifiers just once each.
1 like
Comment
Stine Nielsen

Join Date: Jan 2017

Posts: 8
#4

31 Jan 2017, 09:50

Thank you. I am an epidemiologist working on infectious diseases. My id var refers to individuals and my observations are e.g. about when people got tested (and many got testet multiple times).
I will try with preserve and restore. Was hoping it would be possible with an "if command" something like if id == "unique" - but I guess that is not possible.

Kind regards from Madrid, Spain.

Stine Nielsen

Phd student, EPIET alumni and freelance epidemiologist
LinkedIn: www.linkedin.com/in/stinenielsenepi
Twitter: www.twitter.com/StineNielsenEPI
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

31 Jan 2017, 09:53

I think you're replying to Attaullah in #2. It really is possible with an if qualifier (not command).
Comment
Stine Nielsen

Join Date: Jan 2017

Posts: 8
#6

31 Jan 2017, 10:18

Thx Nick! Yes I replied to post#2 before reading your reply.

Code:

egen tag = tag(id) tab sex if tag==1

works for me.
Thanks again.
Comment
Stine Nielsen

Join Date: Jan 2017

Posts: 8
#7

02 Feb 2017, 05:13

Is there a way to make sure that I always tag the most recent observation per id?

I tried this, but it didn't work:

Code:

bysort id (datevar) : egen tag2 = tag(id)

Each id has many observations and I want to choose / tag the observation with the most recent date

Code:

max(datevar)

per id.
Any tips on how to do that?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#8

02 Feb 2017, 05:22

Code:

bysort id (datevar) : gen byte is_last = _n == _N
Comment
Stine Nielsen

Join Date: Jan 2017

Posts: 8
#9

02 Feb 2017, 05:34

Almost works. But I need it to ignore missing values in datevar - I tried adding

Code:

if datevar!=.

at the end of your suggested code - but this didn't work well (gave me several id's where the observation with the last datevar was not included in "is_last".
More tips?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#10

02 Feb 2017, 06:05

Code:

gen ismissing = -missing(datevar) bysort id (ismissing datevar) : gen byte is_last = _n == _N

Note particularly the minus sign.

missing(datevar) is 1 if the argument is missing and 0 otherwise. If we negate that (-1 and 0), then missings are sorted first for each individual.
Comment
Stine Nielsen

Join Date: Jan 2017

Posts: 8
#11

02 Feb 2017, 06:22

Fantastic! Thanks so much!
Comment

Announcement

Possible to select only unique observations (without dropping duplicates)?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment