Merging two datasets

claire macfarlane

Join Date: Oct 2020

Posts: 17
#1

Merging two datasets

28 Oct 2020, 03:45

Hello! I am working with survey data and want to merge two datasets. One dataset has biographical information and the other employment information (master) on individuals. According to the survey website, both datasets have the identifiers pid (personal id) and syear (survey year). I am, however, unsure whether to do a 1:1, m:1, or 1:m merge. I have tried all three ways, and no matter how I do it there are 566,044 unmatched observations, overwhelmingly from my master dataset, which seems extremely large. I have run a duplicates report of those sharing pid and syear for both datasets and there are no duplicates that share the two identifiers in either.

Code:

merge m:1 pid syear using "C:\Users\zxoqi13\Downloads\biol.dta"

What could this mean? Is it ok that there are so many unmatched observations? What would you advise I look at (commands I could run) to determine what type of merge is appropriate?

Thank you!
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#2

28 Oct 2020, 05:51

Do you get any matches? In any case, if there is more than one observation per identifier in one dataset and you proceed with a 1:1 merge, then Stata will terminate with an error. My guess is that you are fine.

What could this mean? Is it ok that there are so many unmatched observations?

Well, you know your sample best. It is not unreasonable to have unmatched observations if some proportion of survey respondents are not of working age. Others may be of working age but may be unemployed or pursuing education and therefore, they will not have employment records during the observed years.
1 like
Comment
claire macfarlane

Join Date: Oct 2020

Posts: 17
#3

28 Oct 2020, 10:12

Yes I still got ten's of thousands of matches. My concern however is that most of the un-matches were from my master dataset, not the demographic (using) dataset. If they were from the using dataset, I would under stable as you said that some are not of working age or are unemployed.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

28 Oct 2020, 10:27

In a situation like this, I'd browse the data for some of the non-matches. That is the best way to determine what's happening, and will likely lead to an "Aha" experience. Figuring out these problems in the abstract is hard, and looking at actual instances helps a lot.

But thinking in the abstract anyway, here's one simple thought: If the bio information doesn't vary over time (as is sometimes true in situations like yours), or only shows up if there is a change, then including survey year as one of the merge key variables could be wrong, and might be the cause of the match failures. My apologies if this guess is off-target.
2 likes
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10214
#5

28 Oct 2020, 10:28

If some of the workers are seasonal (non-resident), then you may have their employment records but not their demographic details. You could contact the data providers and ask. As far as I tell, this is not an issue with Stata.
Comment
claire macfarlane

Join Date: Oct 2020

Posts: 17
#6

29 Oct 2020, 09:12

Originally posted by Mike Lacy View Post

In a situation like this, I'd browse the data for some of the non-matches. That is the best way to determine what's happening, and will likely lead to an "Aha" experience. Figuring out these problems in the abstract is hard, and looking at actual instances helps a lot.

But thinking in the abstract anyway, here's one simple thought: If the bio information doesn't vary over time (as is sometimes true in situations like yours), or only shows up if there is a change, then including survey year as one of the merge key variables could be wrong, and might be the cause of the match failures. My apologies if this guess is off-target.

Thank you both. @Mikelacy when you say browsing the data, how would I go about doing that? I am a bit of a novice so although I am familiar with the browse command I am unsure how I would apply it here. I will also contact the survey-data providers with my issue to see if they can give me input on the number of unmatched and whether I should be using both identifiers or just pid.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

29 Oct 2020, 11:06

Claire, the secret to helpful use of -browse- is to impose some restrictions with -if- and to focus on a few variables. (More than that is generally just confusing.) You can use use the _merge variable to pick out some mismatches. I'd sort the data to make sure the observations that you want to be matched display contiguously in the data editor, and I'd pick out a few key variables, something like this.

Code:

merge ..... sort pid syear browse pid syear age _merge AFewOtherVariables if (_merge != 3) // observation in master only or using only

Also, I have another thought, which I should have brought up earlier: How many records per individual are there in the biographical info file? Just one? Or multiples in possibly varying numbers over the years? If there are multiple bio records over the years, how are they distinguished (syear and pid??)? I'm now going back to the issue of an 1:m vs. m:1 merge.

I don't think contacting the data providers will be needed. Your problem is more likely a conceptual issue that depends on your purposes and the structure of the data.
2 likes
Comment
claire macfarlane

Join Date: Oct 2020

Posts: 17
#8

31 Oct 2020, 03:20

In sorting and browsing the biographical dataset by pid and syear, there are multiple entries for many individuals, each representing a different survey year. Each person is distinguished by their pid, and then each survey year by syear. Not every respondent has the same amount of duplicates, so I am assuming respondents re-take the biographical survey when something has changed such as getting married or having a child. In this respect, it seems appropriate to use both pid and syear as identifiers, no? Would this be a m:1 merge?

I did a version of what you suggested and I think I had my 'aha' moment as far as why I have so many unmatched in my merge. I did this:

Code:

browse pid syear Age Female _merge if (_merge != 1)

And there are missing ages and genders for all of the respondents whereas in _merge != 3 there are responses for all respondents of either 'does not apply', 'not contained in survey' or an age over 17 (working age). I don't know why there would be missing birthyear and gender and yet they do give answers for other questions such as income or employment status.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10214

31 Oct 2020, 03:44

so I am assuming respondents re-take the biographical survey when something has changed such as getting married or having a child.

If this is the case, you can fill in the missing information after merging. Suppose the following is your master dataset with all years present and the using dataset with only some years present.You merge on both pid and syear and then fill in the last updated values.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(pid syear var2 var3)
1 1935 3078.5   2.8
1 1938 2792.2 209.2
2 1935 1362.4  53.8
2 1938 1801.9 260.2
3 1935 1170.6  97.8
3 1938 2039.7 156.2
4 1935  417.5  10.5
4 1938  437.9  51.8
end
format %ty syear

tempfile using
save `using'

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(pid syear var1)
1 1935 317.6
1 1936 391.8
1 1937 410.6
1 1938 257.7
1 1939 330.8
2 1935 209.9
2 1936 355.3
2 1937 469.9
2 1938 262.3
2 1939 230.4
3 1935  33.1
3 1936    45
3 1937  77.2
3 1938  44.6
3 1939  48.1
4 1935 40.29
4 1936 72.76
4 1937 66.26
4 1938  51.6
4 1939 52.41
end
format %ty syear

merge 1:1 pid syear using `using'

In the above, we have 8 matched observations and 12 unmatched from the master dataset.

Code:

 
. merge 1:1 pid syear using `using'

    Result                           # of obs.
    -----------------------------------------
    not matched                            12
        from master                        12  (_merge==1)
        from using                          0  (_merge==2)

    matched                                 8  (_merge==3)
    -----------------------------------------


     +---------------------------------------------------------+
     | pid   syear   invest     var2    var3            _merge |
     |---------------------------------------------------------|
  1. |   1    1935    317.6   3078.5     2.8       matched (3) |
  2. |   1    1936    391.8        .       .   master only (1) |
  3. |   1    1937    410.6        .       .   master only (1) |
  4. |   1    1938    257.7   2792.2   209.2       matched (3) |
  5. |   1    1939    330.8        .       .   master only (1) |
     |---------------------------------------------------------|
  6. |   2    1935    209.9   1362.4    53.8       matched (3) |
  7. |   2    1936    355.3        .       .   master only (1) |
  8. |   2    1937    469.9        .       .   master only (1) |
  9. |   2    1938    262.3   1801.9   260.2       matched (3) |
 10. |   2    1939    230.4        .       .   master only (1) |
     |---------------------------------------------------------|
 11. |   3    1935     33.1   1170.6    97.8       matched (3) |
 12. |   3    1936       45        .       .   master only (1) |
 13. |   3    1937     77.2        .       .   master only (1) |
 14. |   3    1938     44.6   2039.7   156.2       matched (3) |
 15. |   3    1939     48.1        .       .   master only (1) |
     |---------------------------------------------------------|
 16. |   4    1935    40.29    417.5    10.5       matched (3) |
 17. |   4    1936    72.76        .       .   master only (1) |
 18. |   4    1937    66.26        .       .   master only (1) |
 19. |   4    1938     51.6    437.9    51.8       matched (3) |
 20. |   4    1939    52.41        .       .   master only (1) |
     +---------------------------------------------------------+

Because we assume that the variables are invariant in the missing years, you just need to do the following for all variables in the using dataset (i.e., var2 and var3).

Code:

foreach var in var2 var3{
    bys pid (syear): replace `var'=`var'[_n-1] if missing(`var')
}

Res.:

Code:

 +---------------------------------------------------------+
     | pid   syear   invest     var2    var3            _merge |
     |---------------------------------------------------------|
  1. |   1    1935    317.6   3078.5     2.8       matched (3) |
  2. |   1    1936    391.8   3078.5     2.8   master only (1) |
  3. |   1    1937    410.6   3078.5     2.8   master only (1) |
  4. |   1    1938    257.7   2792.2   209.2       matched (3) |
  5. |   1    1939    330.8   2792.2   209.2   master only (1) |
     |---------------------------------------------------------|
  6. |   2    1935    209.9   1362.4    53.8       matched (3) |
  7. |   2    1936    355.3   1362.4    53.8   master only (1) |
  8. |   2    1937    469.9   1362.4    53.8   master only (1) |
  9. |   2    1938    262.3   1801.9   260.2       matched (3) |
 10. |   2    1939    230.4   1801.9   260.2   master only (1) |
     |---------------------------------------------------------|
 11. |   3    1935     33.1   1170.6    97.8       matched (3) |
 12. |   3    1936       45   1170.6    97.8   master only (1) |
 13. |   3    1937     77.2   1170.6    97.8   master only (1) |
 14. |   3    1938     44.6   2039.7   156.2       matched (3) |
 15. |   3    1939     48.1   2039.7   156.2   master only (1) |
     |---------------------------------------------------------|
 16. |   4    1935    40.29    417.5    10.5       matched (3) |
 17. |   4    1936    72.76    417.5    10.5   master only (1) |
 18. |   4    1937    66.26    417.5    10.5   master only (1) |
 19. |   4    1938     51.6    437.9    51.8       matched (3) |
 20. |   4    1939    52.41    437.9    51.8   master only (1) |
     +---------------------------------------------------------+

Announcement

Merging two datasets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment