Defining id's of individuals in the same family in a panel dataset

Giorgio Nocerino

Join Date: Jun 2024

Posts: 11
#1

Defining id's of individuals in the same family in a panel dataset

12 Sep 2024, 04:04

Hello!
I'm working on a panel dataset that groups individuals based on the id of the family (variable "nquest") and the number of order within the family (variable "nord"), to give an example, if I'm father in the family the value of nord is 1, if I'm the mother it will be 2, if the family has a child the value will be 3.
Anyways when an individual exits the family and another one enters it, the 2 individual can be confused, grouping 2 individuals under the same nquest and nord, this happens because the newcomer will get the same value of the variable nord as the individual that exits the dataset. In the dataset there is an additional variable that allows you to highlight this issue (variable "nordp"), it tells you the number of order of the individual in the previous round of the survey (even if the individual has not completed the survey in the past round). I attach the observations in the dataset for a family in order to make the problem clearer.

In this part of the dataset I added some variables that I did not talk about, which are "anno", that captures the year in which the obervation was taken, "eta", which captures the age of the individual, and "id" which is the is variable I created using the command group(nquest nord) with the goal of creating an unique identifier for each individual.
As you can see after 2006 the individual with nord==3 leaves the dataset, and in 2008 (the following wave of the panel), the individual which was the 4th in the family enters as with an order number of 3, because the other individual left. As you can see just looking at nquest and nord, we "merge" two individuals into one (indeed they have the same id).

My goal here is to create a working unique identifier (variable id) that avoids this kind of confusion within the dataset. Has anybody an idea on how to solve this issue?
If I have not been clear enough I'm more than happy to clear the doubts about this.

Thanks in advance!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10254

12 Sep 2024, 06:47

I gather that "nordp" is a family level variable, so if the individual that exited had the same age as the individual that entered, there is no way to tell which specific id is affected (as nordp varies for all family members in the year after the change). If no such cases exist, you can use age differences to uniquely identify household members, assuming the survey is conducted exactly biannually. For your future posts, use dataex as recommended in FAQ Advice #12.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(nquest nord anno eta nordp id)
632 3 2002 26 3 1686
632 3 2004 28 3 1686
632 3 2006 30 3 1686
632 3 2008 25 4 1686
632 3 2018 27 3 1686
end


bys id (anno): gen unique= sum(eta-eta[_n-1]!=2)
egen newid= group(id unique), label

Res.:

Code:

. l, sepby(newid)

     +-------------------------------------------------------------+
     | nquest   nord   anno   eta   nordp     id   unique    newid |
     |-------------------------------------------------------------|
  1. |    632      3   2002    26       3   1686        1   1686 1 |
  2. |    632      3   2004    28       3   1686        1   1686 1 |
  3. |    632      3   2006    30       3   1686        1   1686 1 |
     |-------------------------------------------------------------|
  4. |    632      3   2008    25       4   1686        2   1686 2 |
  5. |    632      3   2018    27       3   1686        2   1686 2 |
     +-------------------------------------------------------------+

Last edited by Andrew Musau; 12 Sep 2024, 06:49.

Comment

Giorgio Nocerino

Join Date: Jun 2024

Posts: 11
#3

12 Sep 2024, 08:19

Originally posted by Andrew Musau View Post

I gather that "nordp" is a family level variable, so if the individual that exited had the same age as the individual that entered, there is no way to tell which specific id is affected (as nordp varies for all family members in the year after the change). If no such cases exist, you can use age differences to uniquely identify household members, assuming the survey is conducted exactly biannually. For your future posts, use dataex as recommended in FAQ Advice #12.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(nquest nord anno eta nordp id) 632 3 2002 26 3 1686 632 3 2004 28 3 1686 632 3 2006 30 3 1686 632 3 2008 25 4 1686 632 3 2018 27 3 1686 end bys id (anno): gen unique= sum(eta-eta[_n-1]!=2) egen newid= group(id unique), label

Res.:

Code:

. l, sepby(newid) +-------------------------------------------------------------+ | nquest nord anno eta nordp id unique newid | |-------------------------------------------------------------| 1. | 632 3 2002 26 3 1686 1 1686 1 | 2. | 632 3 2004 28 3 1686 1 1686 1 | 3. | 632 3 2006 30 3 1686 1 1686 1 | |-------------------------------------------------------------| 4. | 632 3 2008 25 4 1686 2 1686 2 | 5. | 632 3 2018 27 3 1686 2 1686 2 | +-------------------------------------------------------------+

I partially solved the issue using this code:
sort nquest nord anno
gen new_entry = 0
replace new_entry=1 if (nord != nordp) | missing(nordp)
bysort nquest nord: gen missing_prev_obs = (_n == 1 & nord == nordp & new_entry == 0)
egen id_new = group(nquest nord anno) if new_entry == 1 | missing_prev_obs == 1
bysort nquest nord (anno): replace id_new = id_new[_n-1] if missing(id_new)
bysort nquest nord: replace id_new = sum(id_new) if missing(id_new)

Anyways there are some observations that still have issues:

Code:

* Example generated by -dataex-. For more info, type help dataex clear nquest nord anno eta nordp id_new 855385 4 2010 22 . 9603 855385 4 2012 23 4 9603 861476 3 2012 30 3 9665 861476 3 2010 35 . 9665 end

What do you think? do I remove such observations?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10254
#4

12 Sep 2024, 14:39

The surveys do not seem to be evenly spaced from your additional example, so age cannot be used to identify individuals. With my limited understanding of your variables, I don’t see a way to resolve your issue. You might want to consult the data providers for guidance, as there may be another variable you’re overlooking. Again, I don’t believe "nordp" is the key for the reasons I mentioned in #2, but I may simply be misunderstanding your data.
Comment

Announcement