Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating constant identifier variable across change in id number

    I have a panel of firms, and sometimes the id number of an establishment changes for administrative reasons. The subsequent id number is known, so the structure of the data is as follows, where id_next identifies the new value of a firm's id number in the subsequent period:
    Code:
    input id year id_next
    1 1    .
    1 2 2
    2 3 .
    2 4 3
    3 5 .
    3 6 .
    4 1 .
    4 2 .
    I'm after a way to generate a unique firm id number across changes in id number. In my example, firms 1-3 would be assigned the unique id 3, firm 4 the unique id 4. It feels like it shouldn't be too hard to do this iteratively, but I can't come up with a simple way to do it. Any suggestions are welcome. I am using Stata 14.


  • #2
    This can be complicated in a large data set, especially if the identifiers cannot be grouped in ascending order. Since this is a panel data set, time is repeated and any form of sorting may disrupt the identification of firms. So I suppose that you have an ordering variable that tells you despite changes in identifiers, a particular sequence of time belongs to the same firm. I will call that variable "order" below.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(id year id_next)
    1 1 .
    1 2 2
    2 3 .
    2 4 3
    3 5 .
    3 6 .
    4 1 .
    4 2 .
    end
    
    gen order= _n
    sort order year
    gen new_id= id if id!=id[_n+1] & year[_n+1]<=year
    replace new_id= id in l
    *FILL MISSING VALUES FROM DOWN TO UP. I USE A LOOP - THE VALUE DEPENDS ON THE MAX LENGTH
    forval i=1/10{
         replace new_id=new_id[_n+1] if missing(new_id)
     }

    Res.:

    Code:
    . l, sepby(new_id)
    
         +--------------------------------------+
         | id   year   id_next   order   new_id |
         |--------------------------------------|
      1. |  1      1         .       1        3 |
      2. |  1      2         2       2        3 |
      3. |  2      3         .       3        3 |
      4. |  2      4         3       4        3 |
      5. |  3      5         .       5        3 |
      6. |  3      6         .       6        3 |
         |--------------------------------------|
      7. |  4      1         .       7        4 |
      8. |  4      2         .       8        4 |
         +--------------------------------------+
    ADDED IN EDIT: Best method is to use group_twoway from SSC. The order does not matter here, but you need distinct values of id_next where id_next is missing.

    Code:
    ssc install group_twoway
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(id year id_next)
    1 1 .
    1 2 2
    2 3 .
    2 4 3
    3 5 .
    3 6 .
    4 1 .
    4 2 .
    end
    *FILL IN DISTINCT VALUES FOR ID_NEXT IF MISSING (MUST NOT BE PRESENT IN ID)
    replace id_next= -50000-_n if missing(id_next)
    group_twoway id id_next, gen(newid)
    bys newid (year): gen wanted= id[_N]

    Res.:

    Code:
    . l, sepby(wanted)
    
         +--------------------------------------+
         | id   year   id_next   newid   wanted |
         |--------------------------------------|
      1. |  1      1    -50001       1        3 |
      2. |  1      2         2       1        3 |
      3. |  2      3    -50003       1        3 |
      4. |  2      4         3       1        3 |
      5. |  3      5    -50005       1        3 |
      6. |  3      6    -50006       1        3 |
         |--------------------------------------|
      7. |  4      1    -50007       2        4 |
      8. |  4      2    -50008       2        4 |
         +--------------------------------------+
    Last edited by Andrew Musau; 06 Nov 2019, 08:57.

    Comment


    • #3
      You can also get there using group_id (from SSC).

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(id year id_next)
      1 1 .
      1 2 2
      2 3 .
      2 4 3
      3 5 .
      3 6 .
      4 1 .
      4 2 .
      end
      
      * convert pair of identifiers to a long layout
      gen long obsid = _n
      expand 2 if !mi(id_next)
      bysort obsid: gen long idpair = cond(_n == 1, id, id_next)
      
      * group identifiers if they have matching id_next values
      clonevar uid = id
      group_id uid, matchby(idpair)
      
      * revert to the original observations
      bysort obsid: keep if _n == 1
      Note that group_id was my first contribution to SSC and was developed to handle a very large problem. I repeatedly tweaked the algorithm to handle the task as efficiently as possible. Here's a quick and dirty example that shows that is it significantly faster than group_twoway:
      Code:
      clear all
      set seed 3123
      set obs 100000
      gen long obsid = _n
      gen long id = runiformint(1,100000)
      gen long id_next = runiformint(1,100000) if runiform() < .2
      
      timer on 1
      * convert pair of identifiers to a long layout
      expand 2 if !mi(id_next)
      bysort obsid: gen long idpair = cond(_n == 1, id, id_next)
      
      * group identifiers if they have matching id_next values
      clonevar uid = id
      group_id uid, matchby(idpair)
      
      * revert to the original observations
      bysort obsid: keep if _n == 1
      timer off 1
      
      timer on 2
      replace id_next= -50000-_n if missing(id_next)
      group_twoway id id_next, gen(newid)
      
      * show that new identifiers group the same observations together
      bysort newid (uid): assert uid[1] == uid[_N]
      bysort uid (newid): assert newid[1] == newid[_N]
      timer off 2
      
      timer list
      and the timing results on my computer:
      Code:
      . 
      . timer list
         1:      1.03 /        1 =       1.0310
         2:      5.33 /        1 =       5.3350
      
      .

      Comment


      • #4
        Thanks to both of you for your quick replies.

        Re. Andrew Musau's first suggestion, I unfortunately cannot make the following assumption in my dataset, since the variable -id- is randomly generated, its ordering has no intrinsic meaning:

        So I suppose that you have an ordering variable that tells you despite changes in identifiers, a particular sequence of time belongs to the same firm.
        Both the -group_twoway- and -group_id- solutions work with my data. However, my dataset is large, and on my test sample (~100k observations) the -group_id- solution is an order of magnitude faster:
        Code:
        . timer list
           1:     10.57 /        1 =      10.5720
           2:      0.70 /        1 =       0.6980
        So -group_id- wins the day.

        Comment

        Working...
        X