Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do you merge the unique ids of one dataset into another dataset?

    Hey good people,

    I have two datasets. A has 2000 data points, while B has 10000 data points.

    A dataset is uniquely identified with three vars (region, commune, and village), and B dataset is uniquely identified with four vars (region, commune, village, and polling station names). B cannot be uniquely identified with the three vars that A has.

    Now, I'd like to create a concordance list: region/commune/village in A, and region/commune/village in B.

    I am thinking about creating unique ids for A dataset first, and then apply (merge) those ids to B.
    For example,

    In A:
    Region A, commune A, village A, unique id: 1
    Region A, commune A, village B, unique id: 2

    In B:

    Region A, commune A, village A, polling station A, matching id: 1
    Region A, commune A, village A, polling station B, matching id: 1
    Region A, commune A, village B, polling station A, matching id: 2
    Region A, commune A, village B, polling station B, matching id: 2

    Do you guys know how to create a concordance list like that and how do you use A's unique ids to code B?


    Anyone?

    Thanks!

    Last edited by Xiao Roland Tan; 09 Sep 2017, 11:18.

  • #2
    your subject line says merge but that doesn't appear, in any obvious sense, in your text; however, I do think that a 1:m or m:1 (depending on which dataset is the master) is what you want; read
    Code:
    help merge
    my guess is that you want something like:
    Code:
    use B
    sort region commune village poll_station
    save B2
    us A
    sort region commune village
    merge 1:m region commune village using B2
    you should, of course, use your real variable names

    Comment


    • #3
      Originally posted by Rich Goldstein View Post
      your subject line says merge but that doesn't appear, in any obvious sense, in your text; however, I do think that a 1:m or m:1 (depending on which dataset is the master) is what you want; read
      Code:
      help merge
      my guess is that you want something like:
      Code:
      use B
      sort region commune village poll_station
      save B2
      us A
      sort region commune village
      merge 1:m region commune village using B2
      you should, of course, use your real variable names
      Dear Rich,

      Thank you for your suggestion. I also agree 1:m is a good fit, but I don't really need the final merge product.

      I want the two files sharing the same ids in two separate files. Basically, I want to translate file A's unique id into code file B even though A is uniquely defined by 3 vars while B is uniquely defined by 4 vars. The ids in file B will not uniquely identify B, so we can call them "matching ids." I only need to create a concordance list between A and B. Do you think it is possible?

      Thanks!
      Last edited by Xiao Roland Tan; 09 Sep 2017, 12:37.

      Comment


      • #4
        If some combinations of region commune village occur in one dataset but not the other, you need to combine data from both datasets, reduce to unique combinations of region commune village, create the identifiers and then merge these back with the original datasets. Here's the general approach using made-up data:

        Code:
        clear
        set seed 132
        set obs 5
        gen region = _n
        expand 5
        bysort region: gen commune = _n
        expand 3
        bysort region commune: gen village = _n
        drop if runiform() < .1
        save "dataset_A.dta", replace
        
        clear
        set obs 5
        gen region = _n
        expand 5
        bysort region: gen commune = _n
        expand 3
        bysort region commune: gen village = _n
        drop if runiform() < .2
        expand 7
        bysort region commune village: gen poll_id = _n
        save "dataset_B.dta", replace
        
        * reduce dataset A and B to one observation per group and append
        use region commune village using "dataset_A.dta", clear
        bysort region commune village: keep if _n == 1
        save "dataset_A1.dta", replace
        use region commune village using "dataset_B.dta", clear
        bysort region commune village: keep if _n == 1
        
        * combine observations that identify groups from both A and B
        append using "dataset_A1.dta"
        
        * remove duplicates and assign a final identifier
        bysort region commune village: keep if _n == 1
        gen long group_id = _n
        save "region_commune_village_id.dta", replace
        
        * return to original data and merge with new group identifiers
        use "dataset_A.dta", clear
        merge 1:1 region commune village using "region_commune_village_id.dta", ///
            assert(match using) keep(match) nogen
            
        use "dataset_B.dta", clear
        merge m:1 region commune village using "region_commune_village_id.dta", ///
            assert(match using) keep(match) nogen

        Comment


        • #5
          Originally posted by Robert Picard View Post
          If some combinations of region commune village occur in one dataset but not the other, you need to combine data from both datasets, reduce to unique combinations of region commune village, create the identifiers and then merge these back with the original datasets. Here's the general approach using made-up data:

          Code:
          clear
          set seed 132
          set obs 5
          gen region = _n
          expand 5
          bysort region: gen commune = _n
          expand 3
          bysort region commune: gen village = _n
          drop if runiform() < .1
          save "dataset_A.dta", replace
          
          clear
          set obs 5
          gen region = _n
          expand 5
          bysort region: gen commune = _n
          expand 3
          bysort region commune: gen village = _n
          drop if runiform() < .2
          expand 7
          bysort region commune village: gen poll_id = _n
          save "dataset_B.dta", replace
          
          * reduce dataset A and B to one observation per group and append
          use region commune village using "dataset_A.dta", clear
          bysort region commune village: keep if _n == 1
          save "dataset_A1.dta", replace
          use region commune village using "dataset_B.dta", clear
          bysort region commune village: keep if _n == 1
          
          * combine observations that identify groups from both A and B
          append using "dataset_A1.dta"
          
          * remove duplicates and assign a final identifier
          bysort region commune village: keep if _n == 1
          gen long group_id = _n
          save "region_commune_village_id.dta", replace
          
          * return to original data and merge with new group identifiers
          use "dataset_A.dta", clear
          merge 1:1 region commune village using "region_commune_village_id.dta", ///
          assert(match using) keep(match) nogen
          
          use "dataset_B.dta", clear
          merge m:1 region commune village using "region_commune_village_id.dta", ///
          assert(match using) keep(match) nogen
          Thank you Robert and I will definitely save your code for future reference! But, in the mean time, I am afraid that I should not reduce the data points in dataset B to unique combinations of 3 vars. Otherwise, I would lose 70% of the data points in B as B is uniquely identified by 4 vars. I need that 70% of data points for analysis purposes.

          I wondered whether I could do something like this? First, uniquely identify A by 3 vars. Then, create a column in B, and name it "matching ids." Then, match this column to the unique ids in A. If A and B share the same region/commune/village, then B will have the same id as A. Note this id is not unique in B as B, in most cases, has multiple data points sharing the same 3 vars.

          Then, I will have a list of a concordance list: region/commune/village in A, and region/commune/village in B.

          In A:
          Region A, commune A, village A, unique id: 1
          Region A, commune A, village B, unique id: 2

          In B:

          Region A, commune A, village A, polling station A, matching id: 1
          Region A, commune A, village A, polling station B, matching id: 1
          Region A, commune A, village A, polling station C, matching id: 1
          ...
          Region A, commune A, village B, polling station A, matching id: 2
          Region A, commune A, village B, polling station B, matching id: 2
          ...

          Is this possible? Thanks!
          Last edited by Xiao Roland Tan; 09 Sep 2017, 13:20.

          Comment


          • #6
            You are not loosing anything in either dataset, the example creates a concordance table that you then merge back with each original dataset.

            Comment


            • #7
              Originally posted by Robert Picard View Post
              You are not loosing anything in either dataset, the example creates a concordance table that you then merge back with each original dataset.
              Oh okay Robert. Thank you very much!!!

              Comment


              • #8
                I fail to understand what purpose is served by assigning an ID variable to region/commune/village combinations and producing the concordance. What advantage does your concordance list in post #5 provide that isn't provided by the following summary? What problem is the concordance list supposed to solve?
                Code:
                observations in B                                matching observations in A
                region A commune A village A polling station A   region A commune A village A
                region A commune A village A polling station B   region A commune A village A
                region A commune A village A polling station C   region A commune A village A
                ...
                region A commune A village B polling station A   region A commune A village B
                region A commune A village B polling station B   region A commune A village B
                ...
                With that said, what you describe in post #5 is possible. You can generate the group ID in dataset A as Robert suggested
                Code:
                use A
                sort region commune village
                generate ID = _n
                save A2
                and then
                Code:
                use B
                merge m:1 region commune village using A2, keepusing(ID) keep(master match)
                list region commune village pollingstation if _merge != 3 // find region/commune/village in B that were not in A
                drop _merge
                save B2

                Comment


                • #9
                  Originally posted by Robert Picard View Post
                  If some combinations of region commune village occur in one dataset but not the other, you need to combine data from both datasets, reduce to unique combinations of region commune village, create the identifiers and then merge these back with the original datasets. Here's the general approach using made-up data:

                  Code:
                  clear
                  set seed 132
                  set obs 5
                  gen region = _n
                  expand 5
                  bysort region: gen commune = _n
                  expand 3
                  bysort region commune: gen village = _n
                  drop if runiform() < .1
                  save "dataset_A.dta", replace
                  
                  clear
                  set obs 5
                  gen region = _n
                  expand 5
                  bysort region: gen commune = _n
                  expand 3
                  bysort region commune: gen village = _n
                  drop if runiform() < .2
                  expand 7
                  bysort region commune village: gen poll_id = _n
                  save "dataset_B.dta", replace
                  
                  * reduce dataset A and B to one observation per group and append
                  use region commune village using "dataset_A.dta", clear
                  bysort region commune village: keep if _n == 1
                  save "dataset_A1.dta", replace
                  use region commune village using "dataset_B.dta", clear
                  bysort region commune village: keep if _n == 1
                  
                  * combine observations that identify groups from both A and B
                  append using "dataset_A1.dta"
                  
                  * remove duplicates and assign a final identifier
                  bysort region commune village: keep if _n == 1
                  gen long group_id = _n
                  save "region_commune_village_id.dta", replace
                  
                  * return to original data and merge with new group identifiers
                  use "dataset_A.dta", clear
                  merge 1:1 region commune village using "region_commune_village_id.dta", ///
                  assert(match using) keep(match) nogen
                  
                  use "dataset_B.dta", clear
                  merge m:1 region commune village using "region_commune_village_id.dta", ///
                  assert(match using) keep(match) nogen
                  Hey Robert, I understand your approach now, and I think it is brilliant.

                  I have a follow up question:

                  What if I only want groups that are included in A have IDs?

                  I don't want to assign identifiers to groups that are only included in B.

                  Can I assign unique ids to A based on 3 vars, and than merge this list (name egion_commune_village_id.dta) of identifiers to both files, using your code:

                  "use "dataset_A.dta", clear
                  merge 1:1 region commune village using "region_commune_village_id.dta", ///
                  assert(match using) keep(match) nogen

                  use "dataset_B.dta", clear
                  merge m:1 region commune village using "region_commune_village_id.dta", ///
                  assert(match using) keep(match) nogen"


                  Thanks!
                  Last edited by Xiao Roland Tan; 09 Sep 2017, 13:58.

                  Comment


                  • #10
                    I've skimmed over this thread again and I'm confused. If you only want combinations that are in A to have IDs, then create the IDs in A and merge them with B:

                    Code:
                    * verify that dataset A has one observation per group
                    use "dataset_A.dta", clear
                    isid region commune village, sort
                    
                    * generate an identifier
                    gen long group_id = _n
                    
                    merge 1:m region commune village using "dataset_B.dta"
                    tab _merge
                    The _merge variable will show which observations comes from A only, B only, or from both.

                    Comment


                    • #11
                      Originally posted by Robert Picard View Post
                      I've skimmed over this thread again and I'm confused. If you only want combinations that are in A to have IDs, then create the IDs in A and merge them with B:

                      Code:
                      * verify that dataset A has one observation per group
                      use "dataset_A.dta", clear
                      isid region commune village, sort
                      
                      * generate an identifier
                      gen long group_id = _n
                      
                      merge 1:m region commune village using "dataset_B.dta"
                      tab _merge
                      The _merge variable will show which observations comes from A only, B only, or from both.
                      Thank you Robert!!!

                      Comment

                      Working...
                      X