Interesting consolidation probleim

Michael Risch

Join Date: Jul 2014

Posts: 23
#1

Interesting consolidation probleim

22 Jul 2024, 18:50

Hi all -

I have an interesting problem to solve. I don't even know what to call it so I've had trouble searching for like solutions. If you have a good search term feel free to let me know, but advice otherwise appreciated.

I have grouped parent/child data, but it's not unique. That is child can have multiple parents and there can be multiple children. The parents can also be children, because they have other parents. Each has a characteristic that I'd like to a) count up, and then b) compare within groups. (this is patent data, if you are wondering)

Here's a simple example (missing = no characteristic)

ChildID Child Char ParentID Parent Char

1 A 10 A

2 10 A

3 A 30 B

30 B 40

30 B 50 B

So, from this, I would like some way to:
a) count up unique members of the family (3) (4)
b) count up unique families (1,2,10), (3,30,40,50)
c) count the characteristics in each family (2 - A) (1 - A, 2 -B)
d) compare characteristics within each family (all same) (2 different)

Is there a script or other way to massage the data to get there? Any thoughts or pointers appreciated.

Regards,
MR
Tags: None
Bruce Weaver

Join Date: May 2014

Posts: 1133
#2

22 Jul 2024, 19:09

Hello Michael Risch. I've only been looking at your post for a couple minutes, so I'm not going to claim that I understand exactly what you want to do. But I will make one comment: I think it would be very helpful if you had a unique ID code for each person that is not connected to any of their roles. The data you showed does not include that type of ID code. Do you have one? Thanks for clarifying.

The other thing that's not clear to me is whether you have one big dataset, or separate datasets for the different roles. I think that separate datasets with merging as required for particular analyses would be the better way to go. But this would require the unique PersonID variable I mentioned above.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1411
#3

22 Jul 2024, 21:24

My understanding is that your core problem is to identify families? That is to say, your data do not contain Family IDs. The counting tasks seem secondary and easily solved, once this is achieved.

If this is correct, then you might find the user-contributed command group_twoway useful. Try

Code:

search group_twoway

in Stata, and proceed from there.

Last edited by Hemanshu Kumar; 22 Jul 2024, 21:53.
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#4

23 Jul 2024, 05:26

Bruce Weaver the childid and parentid are unique ids for each record. I do pull the data together with a merge. Indeed, child char and parent char both come from the same file, which is a 1:1 id:char. And the bigger table is created with a 1:m merge - one table has all the children, while the other has any children pair with each parent (however many there are. In that sense, my table is incomplete, because 10 and 50 should also appear in childID, but without any parent. But 40 would not appear there because it has no char and thus wasn't in the original child. 2 does appear on the child side of the full table because it is a merge==2 unmatched, but I keep all parents that have a characteristic. Hope that helps (or points to a better way)
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#5

23 Jul 2024, 05:58

You don't seem to have given consideration to Hemanshu Kumar's observation that you first need to create a family identifier. See how to do this in #5 of https://www.statalist.org/forums/for...neophyte-stata using group_id from SSC. Then, the totals can be easily computed.

Last edited by Andrew Musau; 23 Jul 2024, 06:07.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1411
#6

23 Jul 2024, 07:41

Andrew Musau I'm not sure if this problem is solved by a straightforward use of group_id. Note that identifying children and parents that belong to the same family is an iterative process (it is not simply a matter of putting siblings into the same family, but also grandchildren etc), unlike the sort of problems being solved by the afore-mentioned program.

If I'm mistaken, could you elaborate?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#7

23 Jul 2024, 08:17

Originally posted by Hemanshu Kumar View Post

Note that identifying children and parents that belong to the same family is an iterative process (it is not simply a matter of putting siblings into the same family, but also grandchildren etc)

I believe that is what group_id does. If you have an example in mind, I will be happy to explore.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1411

23 Jul 2024, 12:25

Here is a solution, with the toy dataset posted by OP.

First, the dataset:

Code:

clear
input int child_id str2 child_char int parent_id str2 parent_char
1    "A"    10    "A"
2    ""    10    "A"
3    "A"    30    "B"
30    "B"    40    ""
30    "B"    50    "B"
end

Now for the code:

Code:

* net install group_twoway.pkg // UNCOMMENT THIS LINE ONCE TO INSTALL THE COMMAND

// IDENTIFY SEPARATE FAMILIES
group_twoway parent_id child_id, gen(family_id)

which gives us:

Code:

. list, noobs sepby(family_id) abbrev(12)

  +-------------------------------------------------------------+
  | child_id   child_char   parent_id   parent_char   family_id |
  |-------------------------------------------------------------|
  |        1            A          10             A           1 |
  |        2                       10             A           1 |
  |-------------------------------------------------------------|
  |        3            A          30             B           2 |
  |       30            B          40                         2 |
  |       30            B          50             B           2 |
  +-------------------------------------------------------------+

Now for answering OP's individual questions,

Code:

// CREATE DATASET OF UNIQUE INDIVIDUALS, WITH THEIR CHARACTERISTICS AND FAMILIES

tempfile original
save `original'

keep child_id child_char family_id
duplicates drop
rename child_* person_*
tempfile children
save `children'

use `original'
keep parent_id parent_char family_id
duplicates drop
rename parent_* person_*
append using `children'
duplicates drop

// (A) COUNT MEMBERS OF EACH FAMILY

bysort family_id (person_id): gen num_members = _N
egen tag_family = tag(family_id)

so we can get:

Code:

. list family_id num_members if tag_family, noobs abbrev(12)

  +-------------------------+
  | family_id   num_members |
  |-------------------------|
  |         1             3 |
  |         2             4 |
  +-------------------------+

Code:

// (B) COUNT UNIQUE FAMILIES

sum family_id, meanonly
local num_families = r(max)
dis "The total number of separate families is `num_families'."

which produces:

Code:

The total number of separate families is 2.

Code:

// (C) COUNT NUMBER OF MEMBERS WITH EACH CHARACTERISTIC IN EACH FAMILY

egen char_group = group(family_id person_char)
bysort family_id char_group: gen num_of_this_char = _N
replace num_of_this_char = . if person_char == ""
egen tag_char = tag(family_id char_group)

so we can get:

Code:

. list family_id person_char num_of_this_char if tag_char, noobs abbrev(16)

  +--------------------------------------------+
  | family_id   person_char   num_of_this_char |
  |--------------------------------------------|
  |         1             A                  2 |
  |         2             A                  1 |
  |         2             B                  2 |
  +--------------------------------------------+

Code:

// (D) DOES A FAMILY HAVE ALL CHARACTERISTICS THE SAME OR NOT?

egen min_char_group = min(char_group), by(family_id)
egen max_char_group = max(char_group), by(family_id)

gen byte has_diff_chars = (min_char_group != max_char_group)

so we can get

Code:

. list family_id has_diff_chars if tag_family, noobs abbrev(16)

  +----------------------------+
  | family_id   has_diff_chars |
  |----------------------------|
  |         1                0 |
  |         2                1 |
  +----------------------------+

Last edited by Hemanshu Kumar; 23 Jul 2024, 12:27.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10214

23 Jul 2024, 13:10

The only issue here as far as group_id is concerned is that not all values of parent_id are observed within child_id. That is an easy fix.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int child_id str2 child_char int parent_id str2 parent_char
 1 "A" 10 "A"
 2 ""  10 "A"
 3 "A" 30 "B"
30 "B" 40 ""
30 "B" 50 "B"
end

expand 2, gen(new)
gen id= cond(!new, child_id, parent_id)
clonevar family_id= child_id
*ssc install group_id
group_id family_id, matchby(id)
keep if !new
drop new

Res.:

Code:

. l, sepby(family_id)

     +-----------------------------------------------------------+
     | child_id   child_~r   parent~d   parent~r   id   family~d |
     |-----------------------------------------------------------|
  1. |        1          A         10          A    1          1 |
  2. |        2                    10          A    2          1 |
     |-----------------------------------------------------------|
  3. |        3          A         30          B    3          3 |
  4. |       30          B         50          B   30          3 |
  5. |       30          B         40              30          3 |
     +-----------------------------------------------------------+

.

Comment

Michael Risch

Join Date: Jul 2014

Posts: 23
#10

23 Jul 2024, 14:56

Thanks to you both. I had, in fact, heeded the advice of needing a family id. I ran groupby and it worked...but maybe too well. I had one family with 401,000 members, which I think can be right. I plan to go back and check the data and also test group_id, but this definitely has me on the right track.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10214

#11

23 Jul 2024, 15:38

Problems can arise if you have missing values for the identifiers. Consider, e.g.,

Code:

clear
input int child_id str2 child_char int parent_id str2 parent_char
 1 "A" 10 "A"
 10 ""  .
 2 ""  10 "A"
 3 "A" 30 "B"
 . ""  30 ""
30 "B" 40 ""
. "" 1 "C"
30 "B" 50 "B"
end

expand 2, gen(new)
gen id= cond(!new, child_id, parent_id)
clonevar family_id= child_id
*ssc install group_id
group_id family_id, matchby(id)
keep if !new
drop new

Res.:

Code:

. sort fam

. l, sepby(family_id)

     +-----------------------------------------------------------+
     | child_id   child_~r   parent~d   parent~r   id   family~d |
     |-----------------------------------------------------------|
  1. |        1          A         10          A    1          1 |
  2. |       10                     .              10          1 |
  3. |        2                    10          A    2          1 |
     |-----------------------------------------------------------|
  4. |       30          B         40              30          3 |
  5. |       30          B         50          B   30          3 |
  6. |        3          A         30          B    3          3 |
     |-----------------------------------------------------------|
  7. |        .                    30               .          . |
  8. |        .                     1          C    .          . |
     +-----------------------------------------------------------+

Missing values for the key identifier will be grouped together, leading to erroneous results as individuals with ID=1 and ID=30 should be grouped in one of the other families. I usually solve this by assigning a distinct value to each missing value as below. Assigning negative the observation number, as below, usually does the job as IDs are usually positive integers.

Code:

clear
input int child_id str2 child_char int parent_id str2 parent_char
 1 "A" 10 "A"
 10 ""  .
 2 ""  10 "A"
 3 "A" 30 "B"
 . ""  30 ""
30 "B" 40 ""
. "" 1 "C"
30 "B" 50 "B"
end

expand 2, gen(new)
gen long id= cond(!new, child_id, parent_id)
replace id=-_n if missing(id)
gen long family_id= cond(missing(child_id), parent_id, child_id)
*ssc install group_id
group_id family_id, matchby(id)
keep if !new
drop new

Res.:

Code:

. sort fam

. l, sepby(family_id)

     +-----------------------------------------------------------+
     | child_id   child_~r   parent~d   parent~r   id   family~d |
     |-----------------------------------------------------------|
  1. |        .                     1          C   -7          1 |
  2. |        1          A         10          A    1          1 |
  3. |        2                    10          A    2          1 |
  4. |       10                     .              10          1 |
     |-----------------------------------------------------------|
  5. |        .                    30              -5          3 |
  6. |       30          B         40              30          3 |
  7. |       30          B         50          B   30          3 |
  8. |        3          A         30          B    3          3 |
     +-----------------------------------------------------------+

Comment

Michael Risch

Join Date: Jul 2014

Posts: 23
#12

23 Jul 2024, 17:44

Thanks, very helpful. I have the data put together properly finally. I will fill in the missing and give both of these commands a try!
Comment
Michael Risch

Join Date: Jul 2014

Posts: 23
#13

23 Jul 2024, 20:06

Thanks again - I got the groups working. I need to do a little bit more work to get the counting right (and I have a complication in that parent actually has two characteristics, so I can't just delete duplicates), but I'm well on my way here.
Comment

ChildID	Child Char	ParentID	Parent Char
1	A	10	A
2		10	A
3	A	30	B
30	B	40
30	B	50	B

Announcement