Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two panel datasets with uncommon IDs

    Dear Stata users.

    Please, I need your help with this question. I have two panel datasets, A &B.

    Panel A is the master dataset and contain all variables and ID's that I need in the study.

    Panel B is the secondary dataset, that contains all of the ID's and variables that I need in the study, but they are not sequential. Panel B also contains many IDs that I do not need to have in my master file. Since the dataset is so, large, I cannot sort them one after another before merging the two datasets.

    I want to match these IDs in panel A with B to obtain a dataset that retains every information in panel A, but only the required IDs from B in the master dataset.

    I tried
    merge 1:m ID time using "C:\Users\Beri Parfait\Desktop\Data\HDI_v2.dta"
    . The data merges, but mismatches the IDs as it does so sequentially.
    I tried
    joinby ID time using "C:\Users\Beri Parfait\Desktop\Data\HDI_v2.dta", unmatched(master)
    , but there are similar mismatches with IDs. Someone (William Lisowski ) suggested this
    HTML Code:
    Merging two sets of panel data - Statalist
    but it didn't work for me.

    Please, can you advise on how I can proceed?
    Thank you in advance.

  • #2
    Are the IDs the same across the datasets? Are there variables in B that you need to match to A? Or are these two seperate panels measuring the same sort of thing?

    Comment


    • #3
      Originally posted by George Ford View Post
      Are the IDs the same across the datasets? Are there variables in B that you need to match to A? Or are these two seperate panels measuring the same sort of thing?
      Thank you for replying, George. No, the IDs are not the same. These are two separate panels measuring the same thing. Is it possible for the merge command to search and match based on ID names? If so, hope you can advise.

      Comment


      • #4
        Your question is unclear. If you merge dataset A and dataset B, the idea is that there is some overlap in the IDs in A and B, but B contains additional variables not present in A. Or you want to update some variable values in A with those from B. On the other hand, if A and B contain the same set of variables and there is an overlap in the IDs, but these are distinct datasets (say A refers to time periods 1-2 and B refers to time periods 3-4), then you do not need merge. You need append. You can mark observations in A and keep observations in B if there is a corresponding observation in A.

        Code:
        use master, clear
        gen inA=1
        append using "C:\Users\Beri Parfait\Desktop\Data\HDI_v2.dta"
        bys ID (inA): replace inA= inA[1]
        keep if inA==1
        Last edited by Andrew Musau; 18 Dec 2023, 11:30.

        Comment


        • #5
          That what I was thinking. What Andrew said -- append.

          Comment


          • #6
            Originally posted by Andrew Musau View Post
            Your question is unclear. If you merge dataset A and dataset B, the idea is that there is some overlap in the IDs in A and B, but B contains additional variables not present in A. Or you want to update some variable values in A with those from B. On the other hand, if A and B contain the same set of variables and there is an overlap in the IDs, but these are distinct datasets (say A refers to time periods 1-2 and B refers to time periods 3-4), then you do not need merge. You need append. You can mark observations in A and keep observations in B if there is a corresponding observation in A.

            Code:
            use master, clear
            gen inA=1
            append using "C:\Users\Beri Parfait\Desktop\Data\HDI_v2.dta"
            bys ID (inA): replace inA= inA[1]
            keep if inA==1
            Thank you, Andrew and George, for your help, but the code did not work for me.

            Regarding the clarity of the question, Panel B contains additional IDs (there are more than 80 countries in panel B) that are not needed in the study. Once I declare the two datasets as panels, A and B obviously have different identity numbers. The idea was to extract those countries that are need from Panel B and match them with those in Panel A. I still had a situation where Zimbabwe was being matched with Hongkong, and many more. Because of time, constraints, I had to return to excel.

            Comment


            • #7
              Given your additional information, it sounds like you want to merge rather than append. Declaring your datasets as panel should not change any identifiers. If you give us small samples of your datasets (using the dataex command) so we could see the nature of the identifying variables and the structure of the datasets it would be much easier to advise. I've never done matching in Excel, but I imagine the procedure is much the same as it is in Stata: you must identify the variables which can be used to match observations across datasets.
              Devra Golbe
              Professor Emerita, Dept. of Economics
              Hunter College, CUNY

              Comment


              • #8
                It’s an append. You’ve got to figure out how to do it cleanly. All variables must have the same name across datasets. The panel nature is irrelevant prior to append, as you’re just adding observations. Open B and drop countries you don’t want and use that.

                create new ID. egen group will do. Add 10000 or some other constant to the egen group ID to B so you don’t have same iD across the 2.

                is there any chance you have the same respondents across the two? If si, then duplicates tag/drop to clear that.
                Last edited by George Ford; 18 Dec 2023, 18:13.

                Comment


                • #9
                  A possibility is that you encoded the identifiers prior to merging/ appending. You should have string identifiers.

                  Code:
                  decode ID, gen(id)
                  drop ID
                  save master, replace
                  Then use "id" as an identifier across the datasets. At the end, the best way to understand your problem is to provide samples of your datasets as Devra suggests. You do this e.g. by copying and pasting the results of

                  Code:
                  dataex in 1/20
                  for the master and using files.

                  Comment


                  • #10
                    Thank you for your feedback. while I decode the identifiers, these are the samples of my datasets:
                    Panel A:
                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input long ID int time float(adjustedsavingseducationexpendit schoolenrollmentprimarygrosssepr schoolenrollmentprimaryfemalegro schoolenrollmentprimarymalegross)
                    1 1990 1.9534476  34.57724 23.97939  44.77149
                    1 1991 1.9976286  36.11692 24.93005  46.87694
                    1 1992 2.0418096         .        .         .
                    1 1993 2.0859904  36.47658 18.24781  54.01172
                    1 1994 2.1301715   47.7861 30.87191  64.05412
                    1 1995 2.1743524  49.69833 32.46923  66.26698
                    1 1996 2.2185333         .        .         .
                    1 1997 2.2627144         .        .         .
                    1 1998 2.3068953  34.29688        .         .
                    1 1999 2.3510761  27.29885  4.08203  49.57469
                    1 2000  2.395257  22.16299        0  43.41481
                    1 2001  2.439438  22.90859        0  44.86144
                    1 2002  2.483619  75.95975 46.85711 103.83576
                    1 2003    2.5278  96.55368 68.62765 123.28975
                    1 2004  2.571981 106.25266 63.29473 147.35191
                    1 2005  2.616162  100.2297 73.18865 126.07778
                    1 2006  2.660343 103.93224 79.21624 127.53352
                    1 2007  2.704524 101.08646 76.40813  124.6185
                    1 2008  2.748705 103.41396 80.11948 125.58845
                    1 2009  2.792886  99.36972  78.6424 119.07595
                    end
                    format %ty time
                    label values ID ID
                    label def ID 1 "Afghanistan", modify

                    Panel B
                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input str38 country int time float humandevelopmentindex long ID
                    "Afghanistan" 1990 .273 1
                    "Afghanistan" 1991 .279 1
                    "Afghanistan" 1992 .287 1
                    "Afghanistan" 1993 .297 1
                    "Afghanistan" 1994 .292 1
                    "Afghanistan" 1995  .31 1
                    "Afghanistan" 1996 .319 1
                    "Afghanistan" 1997 .323 1
                    "Afghanistan" 1998 .324 1
                    "Afghanistan" 1999 .332 1
                    "Afghanistan" 2000 .335 1
                    "Afghanistan" 2001 .337 1
                    "Afghanistan" 2002 .362 1
                    "Afghanistan" 2003 .376 1
                    "Afghanistan" 2004 .392 1
                    "Afghanistan" 2005   .4 1
                    "Afghanistan" 2006 .409 1
                    "Afghanistan" 2007 .424 1
                    "Afghanistan" 2008  .43 1
                    "Afghanistan" 2009  .44 1
                    end
                    format %ty time
                    label values ID ID
                    label def ID 1 "Afghanistan", modify

                    Comment


                    • #11
                      Always apply encode to the combined dataset, not prior to merging.

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input str38 country int time float humandevelopmentindex long ID
                      "Afghanistan" 1990 .273 1
                      "Afghanistan" 1991 .279 1
                      "Afghanistan" 1992 .287 1
                      "Afghanistan" 1993 .297 1
                      "Afghanistan" 1994 .292 1
                      "Afghanistan" 1995  .31 1
                      "Afghanistan" 1996 .319 1
                      "Afghanistan" 1997 .323 1
                      "Afghanistan" 1998 .324 1
                      "Afghanistan" 1999 .332 1
                      "Afghanistan" 2000 .335 1
                      "Afghanistan" 2001 .337 1
                      "Afghanistan" 2002 .362 1
                      "Afghanistan" 2003 .376 1
                      "Afghanistan" 2004 .392 1
                      "Afghanistan" 2005   .4 1
                      "Afghanistan" 2006 .409 1
                      "Afghanistan" 2007 .424 1
                      "Afghanistan" 2008  .43 1
                      "Afghanistan" 2009  .44 1
                      end
                      format %ty time
                      label values ID ID
                      label def ID 1 "Afghanistan", modify
                      
                      drop ID
                      tempfile using
                      save `using', replace
                      
                      
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input long ID int time float(adjustedsavingseducationexpendit schoolenrollmentprimarygrosssepr schoolenrollmentprimaryfemalegro schoolenrollmentprimarymalegross)
                      1 1990 1.9534476  34.57724 23.97939  44.77149
                      1 1991 1.9976286  36.11692 24.93005  46.87694
                      1 1992 2.0418096         .        .         .
                      1 1993 2.0859904  36.47658 18.24781  54.01172
                      1 1994 2.1301715   47.7861 30.87191  64.05412
                      1 1995 2.1743524  49.69833 32.46923  66.26698
                      1 1996 2.2185333         .        .         .
                      1 1997 2.2627144         .        .         .
                      1 1998 2.3068953  34.29688        .         .
                      1 1999 2.3510761  27.29885  4.08203  49.57469
                      1 2000  2.395257  22.16299        0  43.41481
                      1 2001  2.439438  22.90859        0  44.86144
                      1 2002  2.483619  75.95975 46.85711 103.83576
                      1 2003    2.5278  96.55368 68.62765 123.28975
                      1 2004  2.571981 106.25266 63.29473 147.35191
                      1 2005  2.616162  100.2297 73.18865 126.07778
                      1 2006  2.660343 103.93224 79.21624 127.53352
                      1 2007  2.704524 101.08646 76.40813  124.6185
                      1 2008  2.748705 103.41396 80.11948 125.58845
                      1 2009  2.792886  99.36972  78.6424 119.07595
                      end
                      format %ty time
                      label values ID ID
                      label def ID 1 "Afghanistan", modify
                      
                      decode ID, gen(country)
                      drop ID
                      merge 1:1 country time using `using', keep(master match) nogen
                      Res.:

                      Code:
                      . order country time
                      
                      . l, sepby(country)
                      
                           +---------------------------------------------------------------------------+
                           |     country   time   adjust~t   school~r   school~o   school~s   humand~x |
                           |---------------------------------------------------------------------------|
                        1. | Afghanistan   1990   1.953448   34.57724   23.97939   44.77149       .273 |
                        2. | Afghanistan   1991   1.997629   36.11692   24.93005   46.87694       .279 |
                        3. | Afghanistan   1992    2.04181          .          .          .       .287 |
                        4. | Afghanistan   1993    2.08599   36.47658   18.24781   54.01172       .297 |
                        5. | Afghanistan   1994   2.130172    47.7861   30.87191   64.05412       .292 |
                        6. | Afghanistan   1995   2.174352   49.69833   32.46923   66.26698        .31 |
                        7. | Afghanistan   1996   2.218533          .          .          .       .319 |
                        8. | Afghanistan   1997   2.262714          .          .          .       .323 |
                        9. | Afghanistan   1998   2.306895   34.29688          .          .       .324 |
                       10. | Afghanistan   1999   2.351076   27.29885    4.08203   49.57469       .332 |
                       11. | Afghanistan   2000   2.395257   22.16299          0   43.41481       .335 |
                       12. | Afghanistan   2001   2.439438   22.90859          0   44.86144       .337 |
                       13. | Afghanistan   2002   2.483619   75.95975   46.85711   103.8358       .362 |
                       14. | Afghanistan   2003     2.5278   96.55368   68.62765   123.2897       .376 |
                       15. | Afghanistan   2004   2.571981   106.2527   63.29473   147.3519       .392 |
                       16. | Afghanistan   2005   2.616162   100.2297   73.18865   126.0778         .4 |
                       17. | Afghanistan   2006   2.660343   103.9322   79.21624   127.5335       .409 |
                       18. | Afghanistan   2007   2.704524   101.0865   76.40813   124.6185       .424 |
                       19. | Afghanistan   2008   2.748705    103.414   80.11948   125.5884        .43 |
                       20. | Afghanistan   2009   2.792886   99.36972    78.6424    119.076        .44 |
                           +---------------------------------------------------------------------------+

                      Comment


                      • #12
                        Ah, so it was a merge.

                        Comment


                        • #13
                          Thank you very much, Andrew. I finally got it right with your code. I appreciate.

                          Comment

                          Working...
                          X