Creating constant identifier variable across change in id number

Sebastien Willis

Join Date: Mar 2017

Posts: 26
#1

Creating constant identifier variable across change in id number

06 Nov 2019, 06:52

I have a panel of firms, and sometimes the id number of an establishment changes for administrative reasons. The subsequent id number is known, so the structure of the data is as follows, where id_next identifies the new value of a firm's id number in the subsequent period:

Code:

input id year id_next 1 1 . 1 2 2 2 3 . 2 4 3 3 5 . 3 6 . 4 1 . 4 2 .

I'm after a way to generate a unique firm id number across changes in id number. In my example, firms 1-3 would be assigned the unique id 3, firm 4 the unique id 4. It feels like it shouldn't be too hard to do this iteratively, but I can't come up with a simple way to do it. Any suggestions are welcome. I am using Stata 14.
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10188

06 Nov 2019, 08:18

This can be complicated in a large data set, especially if the identifiers cannot be grouped in ascending order. Since this is a panel data set, time is repeated and any form of sorting may disrupt the identification of firms. So I suppose that you have an ordering variable that tells you despite changes in identifiers, a particular sequence of time belongs to the same firm. I will call that variable "order" below.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id year id_next)
1 1 .
1 2 2
2 3 .
2 4 3
3 5 .
3 6 .
4 1 .
4 2 .
end

gen order= _n
sort order year
gen new_id= id if id!=id[_n+1] & year[_n+1]<=year
replace new_id= id in l
*FILL MISSING VALUES FROM DOWN TO UP. I USE A LOOP - THE VALUE DEPENDS ON THE MAX LENGTH
forval i=1/10{
     replace new_id=new_id[_n+1] if missing(new_id)
 }

Res.:

Code:

. l, sepby(new_id)

     +--------------------------------------+
     | id   year   id_next   order   new_id |
     |--------------------------------------|
  1. |  1      1         .       1        3 |
  2. |  1      2         2       2        3 |
  3. |  2      3         .       3        3 |
  4. |  2      4         3       4        3 |
  5. |  3      5         .       5        3 |
  6. |  3      6         .       6        3 |
     |--------------------------------------|
  7. |  4      1         .       7        4 |
  8. |  4      2         .       8        4 |
     +--------------------------------------+

ADDED IN EDIT: Best method is to use group_twoway from SSC. The order does not matter here, but you need distinct values of id_next where id_next is missing.

Code:

ssc install group_twoway

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id year id_next)
1 1 .
1 2 2
2 3 .
2 4 3
3 5 .
3 6 .
4 1 .
4 2 .
end
*FILL IN DISTINCT VALUES FOR ID_NEXT IF MISSING (MUST NOT BE PRESENT IN ID)
replace id_next= -50000-_n if missing(id_next)
group_twoway id id_next, gen(newid)
bys newid (year): gen wanted= id[_N]

Res.:

Code:

. l, sepby(wanted)

     +--------------------------------------+
     | id   year   id_next   newid   wanted |
     |--------------------------------------|
  1. |  1      1    -50001       1        3 |
  2. |  1      2         2       1        3 |
  3. |  2      3    -50003       1        3 |
  4. |  2      4         3       1        3 |
  5. |  3      5    -50005       1        3 |
  6. |  3      6    -50006       1        3 |
     |--------------------------------------|
  7. |  4      1    -50007       2        4 |
  8. |  4      2    -50008       2        4 |
     +--------------------------------------+

Last edited by Andrew Musau; 06 Nov 2019, 08:57.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

06 Nov 2019, 09:49

You can also get there using group_id (from SSC).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id year id_next)
1 1 .
1 2 2
2 3 .
2 4 3
3 5 .
3 6 .
4 1 .
4 2 .
end

* convert pair of identifiers to a long layout
gen long obsid = _n
expand 2 if !mi(id_next)
bysort obsid: gen long idpair = cond(_n == 1, id, id_next)

* group identifiers if they have matching id_next values
clonevar uid = id
group_id uid, matchby(idpair)

* revert to the original observations
bysort obsid: keep if _n == 1

Note that group_id was my first contribution to SSC and was developed to handle a very large problem. I repeatedly tweaked the algorithm to handle the task as efficiently as possible. Here's a quick and dirty example that shows that is it significantly faster than group_twoway:

Code:

clear all
set seed 3123
set obs 100000
gen long obsid = _n
gen long id = runiformint(1,100000)
gen long id_next = runiformint(1,100000) if runiform() < .2

timer on 1
* convert pair of identifiers to a long layout
expand 2 if !mi(id_next)
bysort obsid: gen long idpair = cond(_n == 1, id, id_next)

* group identifiers if they have matching id_next values
clonevar uid = id
group_id uid, matchby(idpair)

* revert to the original observations
bysort obsid: keep if _n == 1
timer off 1

timer on 2
replace id_next= -50000-_n if missing(id_next)
group_twoway id id_next, gen(newid)

* show that new identifiers group the same observations together
bysort newid (uid): assert uid[1] == uid[_N]
bysort uid (newid): assert newid[1] == newid[_N]
timer off 2

timer list

and the timing results on my computer:

Code:

. 
. timer list
   1:      1.03 /        1 =       1.0310
   2:      5.33 /        1 =       5.3350

.

Comment

Sebastien Willis

Join Date: Mar 2017

Posts: 26
#4

07 Nov 2019, 03:19

Thanks to both of you for your quick replies.

Re. Andrew Musau's first suggestion, I unfortunately cannot make the following assumption in my dataset, since the variable -id- is randomly generated, its ordering has no intrinsic meaning:

So I suppose that you have an ordering variable that tells you despite changes in identifiers, a particular sequence of time belongs to the same firm.

Both the -group_twoway- and -group_id- solutions work with my data. However, my dataset is large, and on my test sample (~100k observations) the -group_id- solution is an order of magnitude faster:

Code:

. timer list 1: 10.57 / 1 = 10.5720 2: 0.70 / 1 = 0.6980

So -group_id- wins the day.
Comment

Announcement