How to tag new observations in a dataset

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#1

How to tag new observations in a dataset

01 Aug 2018, 06:46

Hi All,

I have dataset that resembles the following:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id year y x) 1 1990 123 32 2 1990 321 23 3 1990 3 23 4 1990 213 23 1 1991 213 123 2 1991 3 123 3 1991 123 213 4 1991 123 123 5 1991 23 23 end

In the above dataset, I have information on y and x, by individual (identified by id) and year.. There is a big jump in my data of new individuals in 1991. In the above, as can be seen, observation with id=5 is new in 1991 ( non existent in 1990). My results are quite sensitive to the inclusion of these new individuals. Suppose that the year 1990 is the "base" year, or the year before which individuals were more or less a constant set. Is there any way such that relative to 1990, I can tag new entrats? So for instance, I would like a dummy variable to be created, where individual 5 in 1991 would get a value of 1 (new entrants get a value of 1) whereas existing ones relative to the previous year get a value of 0?

Best Wishes,
CS
Tags: None
Jesse Tielens

Join Date: Jul 2018

Posts: 46
#2

01 Aug 2018, 07:04

I'm not sure if this is the quickest or best way to do this, but this is how I'd do it:

Code:

bysort id: gen temp =_n gen present_before_1991 = 0 replace present_before_1991 = 1 if year = 1991 & temp > 1 bysort id: egen max_value = max(present_before_1991) replace present_before_1991 = 1 if max_value == 1 drop temp max_value

Let me know if that works for you

EDIT: I just noticed you asked for the exact opposite way of tagging the dummy variable. I've given those entries that were there before 1990 a '1', while you asked to have them tagged as '0'. Nonetheless, you can reverse that if you want of course.

Last edited by Jesse Tielens; 01 Aug 2018, 07:06.
1 like
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#3

01 Aug 2018, 07:13

Hi Jesse!

Many thanks - this works.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35563
#4

01 Aug 2018, 07:16

Some more technique:

Code:

bysort id (year) : gen is_new = year[1] > 1990 bysort id (year) : gen entered_1991 = year[1] == 1991
2 likes
Comment
Jesse Tielens

Join Date: Jul 2018

Posts: 46
#5

01 Aug 2018, 07:31

Originally posted by Nick Cox View Post

Some more technique:

Code:

bysort id (year) : gen is_new = year[1] > 1990 bysort id (year) : gen entered_1991 = year[1] == 1991

That's definitely a shorter and more elegant solution to Chimnay's problem.

If you dont mind me asking, I've noticed in several of your comments that you use this syntax:

Code:

bysort id (year): .....

With 'year' between parentheses. How is that command different from:

Code:

bysort id year: ...

The manual seems to list your code as the correct one, but the output is identical?
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#6

01 Aug 2018, 07:43

Thans Nick Cox !!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35563
#7

01 Aug 2018, 07:43

There is a world of difference there. With

Code:

bysort id (year)

the distinct groups are defined by id alone: within those groups observations are sorted by year. In very many panel problems, that is the kind of thing you often want.

With

Code:

bysort id year

the distinct groups are defined by id and year jointly. For many panel datasets with at most one observation for each identifier and time, that could define for each group at most one observation. It wouldn't bite unless you thought it specified a calculation comparing observations in each panel.

See e.g. https://www.stata-journal.com/sjpdf....iclenum=pr0004 for a tutorial on by:.
1 like
Comment
Jesse Tielens

Join Date: Jul 2018

Posts: 46
#8

01 Aug 2018, 07:48

That's definitely an important distinction, could prove very useful if I've ever got a panel with multiple observations per year. Thanks!
Comment

Chinmay Sharma

Join Date: Nov 2015
Posts: 351

01 Aug 2018, 07:54

I tried another route as well:

Code:

gen present1990=0
by ifscode, sort: replace present1990=1 if !missing(y) & year==1990
gen present1991=0
by ifscode, sort: replace present1991=1 if !missing(y) & year==1991
by ifscode (year), sort: gen new=present1991[_n]-present1990[_n-1]

This works as well, but is definitely not as succinct as Nick's. Thanks Jesse Tielens as well!

Announcement

How to tag new observations in a dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment