Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Defining id's of individuals in the same family in a panel dataset

    Hello!
    I'm working on a panel dataset that groups individuals based on the id of the family (variable "nquest") and the number of order within the family (variable "nord"), to give an example, if I'm father in the family the value of nord is 1, if I'm the mother it will be 2, if the family has a child the value will be 3.
    Anyways when an individual exits the family and another one enters it, the 2 individual can be confused, grouping 2 individuals under the same nquest and nord, this happens because the newcomer will get the same value of the variable nord as the individual that exits the dataset. In the dataset there is an additional variable that allows you to highlight this issue (variable "nordp"), it tells you the number of order of the individual in the previous round of the survey (even if the individual has not completed the survey in the past round). I attach the observations in the dataset for a family in order to make the problem clearer.
    Click image for larger version

Name:	data problem .png
Views:	1
Size:	6.9 KB
ID:	1763617


    In this part of the dataset I added some variables that I did not talk about, which are "anno", that captures the year in which the obervation was taken, "eta", which captures the age of the individual, and "id" which is the is variable I created using the command group(nquest nord) with the goal of creating an unique identifier for each individual.
    As you can see after 2006 the individual with nord==3 leaves the dataset, and in 2008 (the following wave of the panel), the individual which was the 4th in the family enters as with an order number of 3, because the other individual left. As you can see just looking at nquest and nord, we "merge" two individuals into one (indeed they have the same id).

    My goal here is to create a working unique identifier (variable id) that avoids this kind of confusion within the dataset. Has anybody an idea on how to solve this issue?
    If I have not been clear enough I'm more than happy to clear the doubts about this.

    Thanks in advance!

  • #2
    I gather that "nordp" is a family level variable, so if the individual that exited had the same age as the individual that entered, there is no way to tell which specific id is affected (as nordp varies for all family members in the year after the change). If no such cases exist, you can use age differences to uniquely identify household members, assuming the survey is conducted exactly biannually. For your future posts, use dataex as recommended in FAQ Advice #12.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(nquest nord anno eta nordp id)
    632 3 2002 26 3 1686
    632 3 2004 28 3 1686
    632 3 2006 30 3 1686
    632 3 2008 25 4 1686
    632 3 2018 27 3 1686
    end
    
    
    bys id (anno): gen unique= sum(eta-eta[_n-1]!=2)
    egen newid= group(id unique), label
    Res.:


    Code:
    . l, sepby(newid)
    
         +-------------------------------------------------------------+
         | nquest   nord   anno   eta   nordp     id   unique    newid |
         |-------------------------------------------------------------|
      1. |    632      3   2002    26       3   1686        1   1686 1 |
      2. |    632      3   2004    28       3   1686        1   1686 1 |
      3. |    632      3   2006    30       3   1686        1   1686 1 |
         |-------------------------------------------------------------|
      4. |    632      3   2008    25       4   1686        2   1686 2 |
      5. |    632      3   2018    27       3   1686        2   1686 2 |
         +-------------------------------------------------------------+
    Last edited by Andrew Musau; 12 Sep 2024, 06:49.

    Comment


    • #3
      Originally posted by Andrew Musau View Post
      I gather that "nordp" is a family level variable, so if the individual that exited had the same age as the individual that entered, there is no way to tell which specific id is affected (as nordp varies for all family members in the year after the change). If no such cases exist, you can use age differences to uniquely identify household members, assuming the survey is conducted exactly biannually. For your future posts, use dataex as recommended in FAQ Advice #12.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float(nquest nord anno eta nordp id)
      632 3 2002 26 3 1686
      632 3 2004 28 3 1686
      632 3 2006 30 3 1686
      632 3 2008 25 4 1686
      632 3 2018 27 3 1686
      end
      
      
      bys id (anno): gen unique= sum(eta-eta[_n-1]!=2)
      egen newid= group(id unique), label
      Res.:


      Code:
      . l, sepby(newid)
      
      +-------------------------------------------------------------+
      | nquest nord anno eta nordp id unique newid |
      |-------------------------------------------------------------|
      1. | 632 3 2002 26 3 1686 1 1686 1 |
      2. | 632 3 2004 28 3 1686 1 1686 1 |
      3. | 632 3 2006 30 3 1686 1 1686 1 |
      |-------------------------------------------------------------|
      4. | 632 3 2008 25 4 1686 2 1686 2 |
      5. | 632 3 2018 27 3 1686 2 1686 2 |
      +-------------------------------------------------------------+
      I partially solved the issue using this code:
      sort nquest nord anno
      gen new_entry = 0
      replace new_entry=1 if (nord != nordp) | missing(nordp)
      bysort nquest nord: gen missing_prev_obs = (_n == 1 & nord == nordp & new_entry == 0)
      egen id_new = group(nquest nord anno) if new_entry == 1 | missing_prev_obs == 1
      bysort nquest nord (anno): replace id_new = id_new[_n-1] if missing(id_new)
      bysort nquest nord: replace id_new = sum(id_new) if missing(id_new)

      Anyways there are some observations that still have issues:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
       nquest nord anno  eta  nordp id_new
      855385 4 2010 22 .  9603
      855385 4 2012 23 4 9603
      861476 3 2012 30 3 9665
      861476 3 2010 35 . 9665
      end
      What do you think? do I remove such observations?

      Comment


      • #4
        The surveys do not seem to be evenly spaced from your additional example, so age cannot be used to identify individuals. With my limited understanding of your variables, I don’t see a way to resolve your issue. You might want to consult the data providers for guidance, as there may be another variable you’re overlooking. Again, I don’t believe "nordp" is the key for the reasons I mentioned in #2, but I may simply be misunderstanding your data.

        Comment

        Working...
        X