Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interesting consolidation probleim

    Hi all -

    I have an interesting problem to solve. I don't even know what to call it so I've had trouble searching for like solutions. If you have a good search term feel free to let me know, but advice otherwise appreciated.

    I have grouped parent/child data, but it's not unique. That is child can have multiple parents and there can be multiple children. The parents can also be children, because they have other parents. Each has a characteristic that I'd like to a) count up, and then b) compare within groups. (this is patent data, if you are wondering)

    Here's a simple example (missing = no characteristic)
    ChildID Child Char ParentID Parent Char
    1 A 10 A
    2 10 A
    3 A 30 B
    30 B 40
    30 B 50 B
    So, from this, I would like some way to:
    a) count up unique members of the family (3) (4)
    b) count up unique families (1,2,10), (3,30,40,50)
    c) count the characteristics in each family (2 - A) (1 - A, 2 -B)
    d) compare characteristics within each family (all same) (2 different)

    Is there a script or other way to massage the data to get there? Any thoughts or pointers appreciated.

    Regards,
    MR

  • #2
    Hello Michael Risch. I've only been looking at your post for a couple minutes, so I'm not going to claim that I understand exactly what you want to do. But I will make one comment: I think it would be very helpful if you had a unique ID code for each person that is not connected to any of their roles. The data you showed does not include that type of ID code. Do you have one? Thanks for clarifying.

    The other thing that's not clear to me is whether you have one big dataset, or separate datasets for the different roles. I think that separate datasets with merging as required for particular analyses would be the better way to go. But this would require the unique PersonID variable I mentioned above.
    --
    Bruce Weaver
    Email: [email protected]
    Version: Stata/MP 18.5 (Windows)

    Comment


    • #3
      My understanding is that your core problem is to identify families? That is to say, your data do not contain Family IDs. The counting tasks seem secondary and easily solved, once this is achieved.

      If this is correct, then you might find the user-contributed command group_twoway useful. Try
      Code:
      search group_twoway
      in Stata, and proceed from there.
      Last edited by Hemanshu Kumar; 22 Jul 2024, 21:53.

      Comment


      • #4
        Bruce Weaver the childid and parentid are unique ids for each record. I do pull the data together with a merge. Indeed, child char and parent char both come from the same file, which is a 1:1 id:char. And the bigger table is created with a 1:m merge - one table has all the children, while the other has any children pair with each parent (however many there are. In that sense, my table is incomplete, because 10 and 50 should also appear in childID, but without any parent. But 40 would not appear there because it has no char and thus wasn't in the original child. 2 does appear on the child side of the full table because it is a merge==2 unmatched, but I keep all parents that have a characteristic. Hope that helps (or points to a better way)

        Comment


        • #5
          You don't seem to have given consideration to Hemanshu Kumar's observation that you first need to create a family identifier. See how to do this in #5 of https://www.statalist.org/forums/for...neophyte-stata using group_id from SSC. Then, the totals can be easily computed.
          Last edited by Andrew Musau; 23 Jul 2024, 06:07.

          Comment


          • #6
            Andrew Musau I'm not sure if this problem is solved by a straightforward use of group_id. Note that identifying children and parents that belong to the same family is an iterative process (it is not simply a matter of putting siblings into the same family, but also grandchildren etc), unlike the sort of problems being solved by the afore-mentioned program.

            If I'm mistaken, could you elaborate?

            Comment


            • #7
              Originally posted by Hemanshu Kumar View Post
              Note that identifying children and parents that belong to the same family is an iterative process (it is not simply a matter of putting siblings into the same family, but also grandchildren etc)
              I believe that is what group_id does. If you have an example in mind, I will be happy to explore.

              Comment


              • #8
                Here is a solution, with the toy dataset posted by OP.

                First, the dataset:
                Code:
                clear
                input int child_id str2 child_char int parent_id str2 parent_char
                1    "A"    10    "A"
                2    ""    10    "A"
                3    "A"    30    "B"
                30    "B"    40    ""
                30    "B"    50    "B"
                end
                Now for the code:

                Code:
                * net install group_twoway.pkg // UNCOMMENT THIS LINE ONCE TO INSTALL THE COMMAND
                
                // IDENTIFY SEPARATE FAMILIES
                group_twoway parent_id child_id, gen(family_id)
                which gives us:
                Code:
                . list, noobs sepby(family_id) abbrev(12)
                
                  +-------------------------------------------------------------+
                  | child_id   child_char   parent_id   parent_char   family_id |
                  |-------------------------------------------------------------|
                  |        1            A          10             A           1 |
                  |        2                       10             A           1 |
                  |-------------------------------------------------------------|
                  |        3            A          30             B           2 |
                  |       30            B          40                         2 |
                  |       30            B          50             B           2 |
                  +-------------------------------------------------------------+
                Now for answering OP's individual questions,

                Code:
                // CREATE DATASET OF UNIQUE INDIVIDUALS, WITH THEIR CHARACTERISTICS AND FAMILIES
                
                tempfile original
                save `original'
                
                keep child_id child_char family_id
                duplicates drop
                rename child_* person_*
                tempfile children
                save `children'
                
                use `original'
                keep parent_id parent_char family_id
                duplicates drop
                rename parent_* person_*
                append using `children'
                duplicates drop
                
                // (A) COUNT MEMBERS OF EACH FAMILY
                
                bysort family_id (person_id): gen num_members = _N
                egen tag_family = tag(family_id)
                so we can get:

                Code:
                . list family_id num_members if tag_family, noobs abbrev(12)
                
                  +-------------------------+
                  | family_id   num_members |
                  |-------------------------|
                  |         1             3 |
                  |         2             4 |
                  +-------------------------+
                Code:
                // (B) COUNT UNIQUE FAMILIES
                
                sum family_id, meanonly
                local num_families = r(max)
                dis "The total number of separate families is `num_families'."
                which produces:
                Code:
                The total number of separate families is 2.
                Code:
                // (C) COUNT NUMBER OF MEMBERS WITH EACH CHARACTERISTIC IN EACH FAMILY
                
                egen char_group = group(family_id person_char)
                bysort family_id char_group: gen num_of_this_char = _N
                replace num_of_this_char = . if person_char == ""
                egen tag_char = tag(family_id char_group)
                so we can get:

                Code:
                . list family_id person_char num_of_this_char if tag_char, noobs abbrev(16)
                
                  +--------------------------------------------+
                  | family_id   person_char   num_of_this_char |
                  |--------------------------------------------|
                  |         1             A                  2 |
                  |         2             A                  1 |
                  |         2             B                  2 |
                  +--------------------------------------------+
                Code:
                // (D) DOES A FAMILY HAVE ALL CHARACTERISTICS THE SAME OR NOT?
                
                egen min_char_group = min(char_group), by(family_id)
                egen max_char_group = max(char_group), by(family_id)
                
                gen byte has_diff_chars = (min_char_group != max_char_group)
                so we can get
                Code:
                . list family_id has_diff_chars if tag_family, noobs abbrev(16)
                
                  +----------------------------+
                  | family_id   has_diff_chars |
                  |----------------------------|
                  |         1                0 |
                  |         2                1 |
                  +----------------------------+
                Last edited by Hemanshu Kumar; 23 Jul 2024, 12:27.

                Comment


                • #9
                  The only issue here as far as group_id is concerned is that not all values of parent_id are observed within child_id. That is an easy fix.

                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input int child_id str2 child_char int parent_id str2 parent_char
                   1 "A" 10 "A"
                   2 ""  10 "A"
                   3 "A" 30 "B"
                  30 "B" 40 ""
                  30 "B" 50 "B"
                  end
                  
                  expand 2, gen(new)
                  gen id= cond(!new, child_id, parent_id)
                  clonevar family_id= child_id
                  *ssc install group_id
                  group_id family_id, matchby(id)
                  keep if !new
                  drop new
                  Res.:

                  Code:
                  . l, sepby(family_id)
                  
                       +-----------------------------------------------------------+
                       | child_id   child_~r   parent~d   parent~r   id   family~d |
                       |-----------------------------------------------------------|
                    1. |        1          A         10          A    1          1 |
                    2. |        2                    10          A    2          1 |
                       |-----------------------------------------------------------|
                    3. |        3          A         30          B    3          3 |
                    4. |       30          B         50          B   30          3 |
                    5. |       30          B         40              30          3 |
                       +-----------------------------------------------------------+
                  
                  .

                  Comment


                  • #10
                    Thanks to you both. I had, in fact, heeded the advice of needing a family id. I ran groupby and it worked...but maybe too well. I had one family with 401,000 members, which I think can be right. I plan to go back and check the data and also test group_id, but this definitely has me on the right track.

                    Comment


                    • #11
                      Problems can arise if you have missing values for the identifiers. Consider, e.g.,

                      Code:
                      clear
                      input int child_id str2 child_char int parent_id str2 parent_char
                       1 "A" 10 "A"
                       10 ""  .
                       2 ""  10 "A"
                       3 "A" 30 "B"
                       . ""  30 ""
                      30 "B" 40 ""
                      . "" 1 "C"
                      30 "B" 50 "B"
                      end
                      
                      expand 2, gen(new)
                      gen id= cond(!new, child_id, parent_id)
                      clonevar family_id= child_id
                      *ssc install group_id
                      group_id family_id, matchby(id)
                      keep if !new
                      drop new
                      Res.:

                      Code:
                      . sort fam
                      
                      . l, sepby(family_id)
                      
                           +-----------------------------------------------------------+
                           | child_id   child_~r   parent~d   parent~r   id   family~d |
                           |-----------------------------------------------------------|
                        1. |        1          A         10          A    1          1 |
                        2. |       10                     .              10          1 |
                        3. |        2                    10          A    2          1 |
                           |-----------------------------------------------------------|
                        4. |       30          B         40              30          3 |
                        5. |       30          B         50          B   30          3 |
                        6. |        3          A         30          B    3          3 |
                           |-----------------------------------------------------------|
                        7. |        .                    30               .          . |
                        8. |        .                     1          C    .          . |
                           +-----------------------------------------------------------+
                      Missing values for the key identifier will be grouped together, leading to erroneous results as individuals with ID=1 and ID=30 should be grouped in one of the other families. I usually solve this by assigning a distinct value to each missing value as below. Assigning negative the observation number, as below, usually does the job as IDs are usually positive integers.

                      Code:
                      clear
                      input int child_id str2 child_char int parent_id str2 parent_char
                       1 "A" 10 "A"
                       10 ""  .
                       2 ""  10 "A"
                       3 "A" 30 "B"
                       . ""  30 ""
                      30 "B" 40 ""
                      . "" 1 "C"
                      30 "B" 50 "B"
                      end
                      
                      expand 2, gen(new)
                      gen long id= cond(!new, child_id, parent_id)
                      replace id=-_n if missing(id)
                      gen long family_id= cond(missing(child_id), parent_id, child_id)
                      *ssc install group_id
                      group_id family_id, matchby(id)
                      keep if !new
                      drop new
                      Res.:

                      Code:
                      . sort fam
                      
                      . l, sepby(family_id)
                      
                           +-----------------------------------------------------------+
                           | child_id   child_~r   parent~d   parent~r   id   family~d |
                           |-----------------------------------------------------------|
                        1. |        .                     1          C   -7          1 |
                        2. |        1          A         10          A    1          1 |
                        3. |        2                    10          A    2          1 |
                        4. |       10                     .              10          1 |
                           |-----------------------------------------------------------|
                        5. |        .                    30              -5          3 |
                        6. |       30          B         40              30          3 |
                        7. |       30          B         50          B   30          3 |
                        8. |        3          A         30          B    3          3 |
                           +-----------------------------------------------------------+

                      Comment


                      • #12
                        Thanks, very helpful. I have the data put together properly finally. I will fill in the missing and give both of these commands a try!

                        Comment


                        • #13
                          Thanks again - I got the groups working. I need to do a little bit more work to get the counting right (and I have a complication in that parent actually has two characteristics, so I can't just delete duplicates), but I'm well on my way here.

                          Comment

                          Working...
                          X