Identifying Observations With Unique IDs

Kristin Schumacher

Join Date: Oct 2016

Posts: 6
#1

Identifying Observations With Unique IDs

14 Oct 2016, 11:31

Hello, I'm working with IPUM-CPS data and am trying to differentiate between different family units within a household. I have several variables to work with including a household serial ID, a person number within the household, a family type variable that states whether the family is a primary family, secondary related family, or a secondary unrelated family, as well as a family relationship variable that identifies the primary references person, spouse, and child in these secondary families. I also have pointer variables that ID the location of the mother and father of the individual in the household (if any).

I can uniquely identify families within an household using the concat command and combining the household serial and the family type variable - as long as there aren't more than one secondary related families or more than one secondary unrelated families. If there is more than one, the concat command I just described would create a unique ID where they are the same family (see data snapshot below).

Here is a snapshot of the variables I've described in two households. I need to be able to differentiate the different related subfamilies in these households by creating a new unique identifier within each household for these subfamilies. As you can see, persons #3-5 are a family and 6-7 are a family. In the next household, 5-7 are a family and 8-11 are a family. I think there may be a way to do identify these subfamilies using the variables indicating the mothers' and fathers' location in the household, or perhaps the other relationship variables, but I can seem to puzzle this one out. Does anyone have any data management tricks that could group these observations together?

Thanks so much!

Last edited by Kristin Schumacher; 14 Oct 2016, 11:46.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

14 Oct 2016, 11:41

Kristin,

I can't speak for others, but I'm going cross-eyed trying to read your sample data! The FAQs explain clearly how to post example data in the most effective way: with the -dataex- command, that you can get from SSC. They also make it clear that Stata output and code should always be posted within code-delimiters so that things line up in a readable way.

It is unlikely anybody can answer this question without testing some ideas out on your example data--but what you have posted is a jumble that can neither be read easily by human eyes nor imported cleanly into Stata. So it's useless.

Please repost your example data following the guidance in FAQ #12, and perhaps somebody will be able to assist you.
2 likes
Comment

Kristin Schumacher

Join Date: Oct 2016
Posts: 6

14 Oct 2016, 12:11

Thanks for the suggestion, Clyde! What a handy tool. The labels are included for the famrel and ftype variables. The "momloc" and "poploc" variables do not have labels. They simply provide the person number (pernum) for the mother and father in the household (if any). Serialfam is the unique ID mentioned above. Thanks, again!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long serialfam byte(pernum age famrel momloc poploc ftype)
2014948703  4 27 1 8  9 3
2014948703  5  2 3 4  0 3
2014948703  6 23 1 8  9 3
2014948703  7  4 3 4  0 3
2014948703  8 55 1 0  0 3
2014948703  9 61 2 0  0 3
2014951203  6 42 1 2  1 3
2014951203  7 43 2 0  0 3
2014951203  8 12 3 6  7 3
2014951203  9  6 3 6  7 3
2014951203 10  1 3 6  7 3
2014951203 11 42 1 2  1 3
2014951203 12 14 3 6  7 3
2014951203 13 10 3 6  7 3
2014951473  3 26 1 1  2 3
2014951473  4 27 2 0  0 3
2014951473  5  5 3 3  4 3
2014951473  6  0 3 3  4 3
2014951473  7 21 1 1  2 3
2014951473  8  1 3 3  4 3
2014956133  3 39 1 0  0 3
2014956133  4 13 3 6  0 3
2014956133  5  6 3 6  0 3
2014956133  6 39 1 1 10 3
2014956133  8 35 1 1 10 3
2014958463  3 20 1 5  6 3
2014958463  4  3 3 5  6 3
2014958463  5 41 1 1  7 3
2014958463  6 53 2 0  0 3
2014964143  2 42 1 1  0 3
2014964143  3  8 3 4  0 3
2014964143  4 38 1 1  0 3
2014964143  5 13 3 2  0 3
2014964143  6 12 3 2  0 3
2014964143  7 40 1 1  0 3
2014964143  8 16 3 0  0 3
2014964143  9 12 3 7  0 3
2015889343  3 42 1 2  1 3
2015889343  4 17 3 8  9 3
2015889343  5 16 3 8  9 3
2015889343  6 13 3 8  9 3
2015889343  7 10 3 8  9 3
2015889343  8 50 1 0  0 3
2015889343  9 54 2 2  1 3
2015895403  3 30 1 2  1 3
2015895403  4 29 2 0  0 3
2015895403  5  2 3 8  7 3
2015895403  6  1 3 4  3 3
2015895403  7 34 1 0  0 3
2015895403  8 35 2 2  1 3
2015895493  4 45 1 0  0 3
2015895493  5 50 2 0  0 3
2015895493  6 10 3 4  5 3
2015895493  7 80 1 0  0 3
2015895493  8 80 2 0  0 3
2015903873  3 38 1 2  1 3
2015903873  4  5 3 6  7 3
2015903873  5  2 3 6  7 3
2015903873  6 28 1 2  1 3
2015903873  7 29 2 0  0 3
2015903873  8  2 3 6  7 3
2015905473  3 43 1 0  1 3
2015905473  4 12 3 6  7 3
2015905473  5 11 3 6  7 3
2015905473  6 37 1 0  1 3
2015905473  7 35 2 0  0 3
2015905473  8 16 3 6  7 3
2015905473  9 11 3 6  7 3
2015908353  4 42 1 2  1 3
2015908353  5  9 3 6  7 3
2015908353  6 34 1 2  1 3
2015908353  7 35 2 0  0 3
2015908353  8 38 1 0  0 3
2015908353  9 41 2 2  1 3
2015909553  3 30 1 1  2 3
2015909553  4 26 2 0  0 3
2015909553  5  7 3 4  3 3
2015909553  6  0 3 0  0 3
2015909553  7 25 1 1  2 3
2015909553  8 23 2 0  0 3
2015909553  9  4 3 8  7 3
2015912033  6 43 1 2  1 3
2015912033  7 44 2 0  0 3
2015912033  8 12 3 6  7 3
2015912033  9  8 3 6  7 3
2015912033 10  2 3 6  7 3
2015912033 11 43 1 2  1 3
2015912033 12 14 3 6  7 3
2015912033 13 11 3 6  7 3
2015912273  4 38 1 7  8 3
2015912273  5  5 3 2  0 3
2015912273  6  0 3 2  0 3
2015912273  7 71 1 0  0 3
2015912273  8 79 2 0  0 3
2015915403  5 21 1 2  1 3
2015915403  6  4 3 5  0 3
2015915403  7 68 1 0  0 3
2015915403  8 80 2 0  0 3
2015917863  5 44 1 2  1 3
2015917863  6 27 2 0  0 3
end
label values age age_lbl
label def age_lbl 0 "Under 1 year", modify
label def age_lbl 1 "1", modify
label def age_lbl 2 "2", modify
label def age_lbl 3 "3", modify
label def age_lbl 4 "4", modify
label def age_lbl 5 "5", modify
label def age_lbl 6 "6", modify
label def age_lbl 7 "7", modify
label def age_lbl 8 "8", modify
label def age_lbl 9 "9", modify
label def age_lbl 10 "10", modify
label def age_lbl 11 "11", modify
label def age_lbl 12 "12", modify
label def age_lbl 13 "13", modify
label def age_lbl 14 "14", modify
label def age_lbl 16 "16", modify
label def age_lbl 17 "17", modify
label def age_lbl 20 "20", modify
label def age_lbl 21 "21", modify
label def age_lbl 23 "23", modify
label def age_lbl 25 "25", modify
label def age_lbl 26 "26", modify
label def age_lbl 27 "27", modify
label def age_lbl 28 "28", modify
label def age_lbl 29 "29", modify
label def age_lbl 30 "30", modify
label def age_lbl 34 "34", modify
label def age_lbl 35 "35", modify
label def age_lbl 37 "37", modify
label def age_lbl 38 "38", modify
label def age_lbl 39 "39", modify
label def age_lbl 40 "40", modify
label def age_lbl 41 "41", modify
label def age_lbl 42 "42", modify
label def age_lbl 43 "43", modify
label def age_lbl 44 "44", modify
label def age_lbl 45 "45", modify
label def age_lbl 50 "50", modify
label def age_lbl 53 "53", modify
label def age_lbl 54 "54", modify
label def age_lbl 55 "55", modify
label def age_lbl 61 "61", modify
label def age_lbl 68 "68", modify
label def age_lbl 71 "71", modify
label def age_lbl 79 "79", modify
label def age_lbl 80 "80", modify
label values famrel famrel_lbl
label def famrel_lbl 1 "Reference person", modify
label def famrel_lbl 2 "Spouse", modify
label def famrel_lbl 3 "Child", modify
label values ftype ftype_lbl
label def ftype_lbl 3 "Related subfamily", modify

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#4

14 Oct 2016, 13:04

Well, I don't think this is a data management issue so much as a data interpretation issue. After looking at the data carefully, it isn't at all obvious which people belong to the same subfamily. But I can think of a few possibilities.

The simplest one is to take the -pernum- variable serioiusly and to assume that each person belongs to a subfamily headed by the nearest preceding person who is a "Reference person." That is, we are assuming that the assignment of -pernum- reflects the subfamily structure, presumably known to the curators of the data but not elsewhere divulged in the data set. If that's true, then it's actually very easy:

Code:

by serialfam (pernum), sort: gen int subfamily = sum(famrel == "Reference person":famrel_lbl)

But your original post suggests that the pointers to mother and father play a role in defining these families. In that case, I am at something of a loss. FIrst of all, there are people whose pointers point to observations that do not exist, such as 0, 1, or 2 when the pernum's for that family begin at 3. Also, have a look at serialfam 2014951203. Everyone listed as a "child" in that family has pernum's 6 and 7 as mom and pop. Then pernum's 6 and 11 are both designated as "Reference person"--and they appear to have the same parents. So maybe they are siblings (twins? they are the same age). So is this two subfamilies, one headed by 6 and one headed by 11, but the subfamily headed by 11 includes children of 6? Or is this just one subfamily consisting of a couple (6 and 7) along with 6's sibling, and a bunch of 6 and 7's children? The code above treats them as two subfamilies, but the family-tree that we can construct would suggest otherwise. I suppose it depends on what is, for your purposes, a subfamily.
1 like
Comment
Kristin Schumacher

Join Date: Oct 2016

Posts: 6
#5

14 Oct 2016, 14:29

Thanks, Clyde! That command almost gets me there, but you are right, it only works if each person belongs to a subfamily headed by the nearest preceding reference person. In some households that is the case, but not in all. The example you pointed to demonstrates just that. I interpret this particular case as two separate subfamilies, where subfamily can be one individual. Here, the individuals are not listed according to the subfamily structure. We can see that from the momloc and poploc pointer variables. So, the new subfamily variable incorrectly classifies the two kids with pernum 12 and 13.

As for the pointers for the observations that don't exist, the sample data I generated with the dataex command was limited to subfamilies with children. There are 50K observations in the dataset and I didn't know how many I would have to include to get a useful subset. So, the missing pernums that momloc and poploc point to exist in the real data, just not the subset. (Zero means that the parent isn't present in the household.)
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

14 Oct 2016, 15:15

Well, the following solution does not work completely, but I think it's a major step forward and perhaps you can figure out how to take it from here. The idea is to first identify each Reference person as the start of a subfamily (which may end up containing nobody else). Then all Child observations get associated with the subfamily of their mother or father. Then Spouse observations get associated with the immediately preceding Reference person.

There is a technical issue in implementing this: the way to grab, for each child, the subfamily number associated with its mother or father is to use subscripting. But for that to work we can't use the pernum variable, because it does not start at 1 within each family (or at least, as you say, not in your example) So I create a new variable, seq, which goes in the same order as pernum but begins at 1 within each serialfam in the data, and user seq for the subscripting.

Code:

set more off
* Example generated by -dataex-. To install: ssc install dataex
clear
input long serialfam byte(pernum age famrel momloc poploc ftype)
2014948703  4 27 1 8  9 3
2014948703  5  2 3 4  0 3
2014948703  6 23 1 8  9 3
2014948703  7  4 3 4  0 3
2014948703  8 55 1 0  0 3
2014948703  9 61 2 0  0 3
2014951203  6 42 1 2  1 3
2014951203  7 43 2 0  0 3
2014951203  8 12 3 6  7 3
2014951203  9  6 3 6  7 3
2014951203 10  1 3 6  7 3
2014951203 11 42 1 2  1 3
2014951203 12 14 3 6  7 3
2014951203 13 10 3 6  7 3
2014951473  3 26 1 1  2 3
2014951473  4 27 2 0  0 3
2014951473  5  5 3 3  4 3
2014951473  6  0 3 3  4 3
2014951473  7 21 1 1  2 3
2014951473  8  1 3 3  4 3
2014956133  3 39 1 0  0 3
2014956133  4 13 3 6  0 3
2014956133  5  6 3 6  0 3
2014956133  6 39 1 1 10 3
2014956133  8 35 1 1 10 3
2014958463  3 20 1 5  6 3
2014958463  4  3 3 5  6 3
2014958463  5 41 1 1  7 3
2014958463  6 53 2 0  0 3
2014964143  2 42 1 1  0 3
2014964143  3  8 3 4  0 3
2014964143  4 38 1 1  0 3
2014964143  5 13 3 2  0 3
2014964143  6 12 3 2  0 3
2014964143  7 40 1 1  0 3
2014964143  8 16 3 0  0 3
2014964143  9 12 3 7  0 3
2015889343  3 42 1 2  1 3
2015889343  4 17 3 8  9 3
2015889343  5 16 3 8  9 3
2015889343  6 13 3 8  9 3
2015889343  7 10 3 8  9 3
2015889343  8 50 1 0  0 3
2015889343  9 54 2 2  1 3
2015895403  3 30 1 2  1 3
2015895403  4 29 2 0  0 3
2015895403  5  2 3 8  7 3
2015895403  6  1 3 4  3 3
2015895403  7 34 1 0  0 3
2015895403  8 35 2 2  1 3
2015895493  4 45 1 0  0 3
2015895493  5 50 2 0  0 3
2015895493  6 10 3 4  5 3
2015895493  7 80 1 0  0 3
2015895493  8 80 2 0  0 3
2015903873  3 38 1 2  1 3
2015903873  4  5 3 6  7 3
2015903873  5  2 3 6  7 3
2015903873  6 28 1 2  1 3
2015903873  7 29 2 0  0 3
2015903873  8  2 3 6  7 3
2015905473  3 43 1 0  1 3
2015905473  4 12 3 6  7 3
2015905473  5 11 3 6  7 3
2015905473  6 37 1 0  1 3
2015905473  7 35 2 0  0 3
2015905473  8 16 3 6  7 3
2015905473  9 11 3 6  7 3
2015908353  4 42 1 2  1 3
2015908353  5  9 3 6  7 3
2015908353  6 34 1 2  1 3
2015908353  7 35 2 0  0 3
2015908353  8 38 1 0  0 3
2015908353  9 41 2 2  1 3
2015909553  3 30 1 1  2 3
2015909553  4 26 2 0  0 3
2015909553  5  7 3 4  3 3
2015909553  6  0 3 0  0 3
2015909553  7 25 1 1  2 3
2015909553  8 23 2 0  0 3
2015909553  9  4 3 8  7 3
2015912033  6 43 1 2  1 3
2015912033  7 44 2 0  0 3
2015912033  8 12 3 6  7 3
2015912033  9  8 3 6  7 3
2015912033 10  2 3 6  7 3
2015912033 11 43 1 2  1 3
2015912033 12 14 3 6  7 3
2015912033 13 11 3 6  7 3
2015912273  4 38 1 7  8 3
2015912273  5  5 3 2  0 3
2015912273  6  0 3 2  0 3
2015912273  7 71 1 0  0 3
2015912273  8 79 2 0  0 3
2015915403  5 21 1 2  1 3
2015915403  6  4 3 5  0 3
2015915403  7 68 1 0  0 3
2015915403  8 80 2 0  0 3
2015917863  5 44 1 2  1 3
2015917863  6 27 2 0  0 3
end
label values age age_lbl
label def age_lbl 0 "Under 1 year", modify
label def age_lbl 1 "1", modify
label def age_lbl 2 "2", modify
label def age_lbl 3 "3", modify
label def age_lbl 4 "4", modify
label def age_lbl 5 "5", modify
label def age_lbl 6 "6", modify
label def age_lbl 7 "7", modify
label def age_lbl 8 "8", modify
label def age_lbl 9 "9", modify
label def age_lbl 10 "10", modify
label def age_lbl 11 "11", modify
label def age_lbl 12 "12", modify
label def age_lbl 13 "13", modify
label def age_lbl 14 "14", modify
label def age_lbl 16 "16", modify
label def age_lbl 17 "17", modify
label def age_lbl 20 "20", modify
label def age_lbl 21 "21", modify
label def age_lbl 23 "23", modify
label def age_lbl 25 "25", modify
label def age_lbl 26 "26", modify
label def age_lbl 27 "27", modify
label def age_lbl 28 "28", modify
label def age_lbl 29 "29", modify
label def age_lbl 30 "30", modify
label def age_lbl 34 "34", modify
label def age_lbl 35 "35", modify
label def age_lbl 37 "37", modify
label def age_lbl 38 "38", modify
label def age_lbl 39 "39", modify
label def age_lbl 40 "40", modify
label def age_lbl 41 "41", modify
label def age_lbl 42 "42", modify
label def age_lbl 43 "43", modify
label def age_lbl 44 "44", modify
label def age_lbl 45 "45", modify
label def age_lbl 50 "50", modify
label def age_lbl 53 "53", modify
label def age_lbl 54 "54", modify
label def age_lbl 55 "55", modify
label def age_lbl 61 "61", modify
label def age_lbl 68 "68", modify
label def age_lbl 71 "71", modify
label def age_lbl 79 "79", modify
label def age_lbl 80 "80", modify
label values famrel famrel_lbl
label def famrel_lbl 1 "Reference person", modify
label def famrel_lbl 2 "Spouse", modify
label def famrel_lbl 3 "Child", modify
label values ftype ftype_lbl
label def ftype_lbl 3 "Related subfamily", modify

//    CALCULATE A SEQUENTIAL PERSON NUMBER THAT STARTS FROM 1
//    IN EACH SERIAL FAMILY
by serialfam (pernum), sort: gen int seq = _n
order seq, after(pernum)
//    CREATE NEW PARENTAL POINTERS BASED ON SEQ
by serialfam (pernum): gen int momptr = momloc - pernum[1] + 1
by serialfam (pernum): gen int popptr = poploc - pernum[1] + 1
order *ptr, after(poploc)

//    SEQUENTIALLY NUMBER THE DIFFERENT REFERENCE PERSONS IN EACH
//    SERIAL FAMILY
by serialfam (seq), sort: gen subfamily_num = 1 if famrel == ///
    "Reference person":famrel_lbl
by serialfam (seq): replace subfamily_num = sum(subfamily_num) if famrel == ///
    "Reference person":famrel_lbl
order subfamily_num, after(seq)

//    NOW ASSOCIATE EACH CHILD WITH THE subfamily_num OF ITS PARENT
by serialfam (seq): replace subfamily_num = subfamily_num[momptr] ///
    if missing(subfamily_num) & famrel == "Child":famrel_lbl & momptr >= 1
by serialfam (seq): replace subfamily_num = subfamily_num[popptr] ///
    if missing(subfamily_num) & famrel == "Child":famrel_lbl & popptr >= 1
    
//    ASSOCIATE EACH SPOUSE WITH THE IMMEDIATELY PRECEDING REFERENCE PERSON
replace subfamily_num = subfamily_num[_n-1] if famrel == "Spouse":famrel_lbl ///
    & famrel[_n-1] == "Reference person":famrel_lbl

by serialfam: egen problem = max(missing(subfamily_num))
browse if problem

Now, this leaves some people not attached to any subfamily. In the example there are three serialfam's that have this problem, and in each case it is a matter of children whose momloc and poploc do not appear in the example data. Some of these will go away in the full data because your example will include their parents. But for some of these, both momloc and poploc are 0, so no parent is identifiable even in the full data. I don't know how one would go about attaching these children to a subfamily.

Comment

Kristin Schumacher

Join Date: Oct 2016

Posts: 6
#7

18 Oct 2016, 14:42

Clyde, thanks so much for your help. This was exactly the help I was hoping to get and with just a few modifications works great with the full dataset. I really appreciate it!
Comment

Announcement