How do I merge data files with multiple identifiers that are different?

Francis Dompae

Join Date: Nov 2016

Posts: 7
#1

How do I merge data files with multiple identifiers that are different?

16 Nov 2016, 05:48

Dear Statalisters,

I am a new stata user currently working on a socioeconomic dataset of about 5,000 households, and need help to merge some of the data. I am trying to merge 5 files, each with two variables as unique identifiers, jointly. One of the two identifiers in each file (hhno – household id) is common across all the 5 files but the rest are different. The files and their identifiers are as follows:
No File Identifiers

1 Land data hhno plotno

2 Finance data hhno finc_no

3 Asset data hhno dasset

4 Tools data hhno ftool

5 Extension data hhno org_id

I am doing plot level analysis which duplicates hhno. Except m:m merge, the other more assured merge types predictably return the error “variable not found”, respectively referring to the data item identifiers. As the m:m merge output (matches) does not look convincing, I tried using –joinby– whose output does not look convincing either. What I always expect to match in some cases don’t. Next, I tried grouping the 2 unique identifiers in each file using ege, group() concatenation command but doubt if that is reasonable and/or appropriate for my purpose.

If I can be sure that using this -egen, group()- command to concatenate the pairs of each file into 1 unique id will be centered on the hhno, I can proceed with the merging, but I don’t get that impression in both reading and experimenting with that. What I want to do (merging) seems basic but I am stuck, and hope and look forward to your kind assistance guiding me on how to merge the data.

Thank you!
Francis
Tags: None
Chrisk Burningham

Join Date: Oct 2016

Posts: 35
#2

16 Nov 2016, 06:22

Hi Francis, its not completely clear from the table what you need doing.

However at a guess, one thing to try is to create a new variable that concatenates the two identifiers:

Code:

egen newvar = concat(identifier1 identifier2), punct(" ") duplicates list new var

The merge requires completely unique identifiers, otherwise stata would not know what to merge. Hopefully when you list duplicates there should be 0 and you can repeat the process for the other file you wish to merge.
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#3

16 Nov 2016, 07:06

This question could be made a lot clearer if you would provide more detail on the variables in each set (or two of them, to see the logic). Please provide data examples with dataex (see the faq).
Pending more info, there's a number of possible guesses. Its possible that you'd need to merge on hhid only.
Another option is that you aren't seeking to merge, but to append (more in this link)
It's also possible that some form of reshaping individual datasets is needed before a merge is sensible.
Comment

Francis Dompae

Join Date: Nov 2016
Posts: 7

16 Nov 2016, 14:40

Thank you very much Chrisk and Jorrit for the quick and helpful comments! Sorry for the lack of clarity in the specification of my problem but I think your guesses have pointed to the solution. I had already appended some of the files which are captured and stored by 2 different farming seasons with same questions but capturing different households. So, I experimented with the code and reshaping the data as suggested. The files are in long format so I reshaped to wide and hhno becomes unique for the merging but that means I have to reshape again to long upon a successful merging. Samples of the Tools and Asset data which I just experimented with are as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long hhno byte(ftool numtool) float(tiltool spraymach)
101001002  4 1 0 0
101001002 14 1 0 0
101001003  4 3 0 0
101001003  5 7 0 0
101001003  6 2 0 0
101001003  8 2 0 0
101001003 14 3 0 0
101001004  4 2 0 0
101001004  9 3 0 0
101001004 14 1 0 0
end
label values ftool s3aii_0
label def s3aii_0 4 "Hoe", modify
label def s3aii_0 5 "Axe", modify
label def s3aii_0 6 "Rake", modify
label def s3aii_0 8 "Pick axe", modify
label def s3aii_0 9 "Sickle/reaping hook", modify
label def s3aii_0 14 "Cutlass", modify
label values tiltool tiltool
label def tiltool 0 "No", modify
label values spraymach spraymach
label def spraymach 0 "No", modify

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long hhno byte dasset int numgood byte gowner1
101001002 3 1 1
101001002 4 1 1
101001003 1 1 1
101001003 4 2 1
101001003 5 1 1
101001004 1 3 1
101001004 3 1 1
101001004 4 1 1
101001009 3 1 1
101001009 4 1 .
end
label values dasset dasset
label def dasset 1 "Non-farm land", modify
label def dasset 3 "Mobile phone", modify
label def dasset 4 "Radio", modify
label def dasset 5 "Television", modify
label values gowner1 s3aiii_b1

Your support is highly appreciated and I thank you once again!
Francis

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#5

17 Nov 2016, 01:00

Good to hear the suggestions helped.
Now I'm no sure is this problem is solved or if you are still stuck somewhere in the process? Would you still need to merge the datasets provided? It does sound like you figured out which steps to take now.
Comment
Francis Dompae

Join Date: Nov 2016

Posts: 7
#6

17 Nov 2016, 17:01

Thank you Jorrit for following up on my issue. I am still a bit stuck in the process of merging the data.

Yes I still need to merge the datasets provided and have been trying to do that. I reshaped the files to wide, merged 1:1 using hhno (keeping all 3 matched types) and reshaped back to long, but I am not confident with the output because I am not quite sure of the commands I used. I run the following commands/outputs in the process:

. use Durablegoods, clear
. reshape wide numgood gowner1 valgood, i(hhno) j(dasset)
(note: j = 1 2 3 4 5 6)

Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 9749 -> 4261
Number of variables 7 -> 21
j variable (6 values) dasset -> (dropped)
xij variables:
numgood -> numgood1 numgood2 ... numgood6
gowner1 -> gowner11 gowner12 ... gowner16
valgood -> valgood1 valgood2 ... valgood6
-----------------------------------------------------------------------------

. save Durablegoods_wide,replace
file Durablegoods_wide.dta saved

. use Farmtools, clear
. reshape wide numtool towner1 valtool tiltool cuttool engtool axtool spraymach cartool, i(hhno) j(ftool
> )
(note: j = 1 2 4 5 6 8 9 10 11 14 15 16 19)

Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 9569 -> 3839
Number of variables 14 -> 121
j variable (13 values) ftool -> (dropped)
xij variables:
numtool -> numtool1 numtool2 ... numtool19
towner1 -> towner11 towner12 ... towner119
valtool -> valtool1 valtool2 ... valtool19
tiltool -> tiltool1 tiltool2 ... tiltool19
cuttool -> cuttool1 cuttool2 ... cuttool19
engtool -> engtool1 engtool2 ... engtool19
axtool -> axtool1 axtool2 ... axtool19
spraymach -> spraymach1 spraymach2 ... spraymach19
cartool -> cartool1 cartool2 ... cartool19
-----------------------------------------------------------------------------

. save Farmtools_wide, replace
file Farmtools_wide.dta saved

. merge 1:1 hhno using Durablegoods_wide
(label id1 already defined)

Result # of obs.
-----------------------------------------
not matched 1,558
from master 568 (_merge==1)
from using 990 (_merge==2)

matched 3,271 (_merge==3)
-----------------------------------------

. reshape long numtool towner1 valtool tiltool cuttool engtool axtool spraymach cartool numgood gowner1
> valgood, i(hhno) j(ftool)
(note: j = 1 2 3 4 5 6 8 9 10 11 14 15 16 19)

Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4829 -> 67606
Number of variables 141 -> 19
j variable (14 values) -> ftool
xij variables:
numtool1 numtool2 ... numtool19 -> numtool
towner11 towner12 ... towner119 -> towner1
valtool1 valtool2 ... valtool19 -> valtool
tiltool1 tiltool2 ... tiltool19 -> tiltool
cuttool1 cuttool2 ... cuttool19 -> cuttool
engtool1 engtool2 ... engtool19 -> engtool
axtool1 axtool2 ... axtool19 -> axtool
spraymach1 spraymach2 ... spraymach19 -> spraymach
cartool1 cartool2 ... cartool19 -> cartool
numgood1 numgood2 ... numgood19 -> numgood
gowner11 gowner12 ... gowner119 -> gowner1
valgood1 valgood2 ... valgood19 -> valgood
-----------------------------------------------------------------------------

What should I do so as not to lose the secondary identifiers (ftool & dasset) as the first conversion to wide drops them and the re-conversion to long after merging distorts the j(ftool) I specified. I also observed missing values that seem arbitrary in some variables. Where did I go wrong and what would you suggest I do differently? I look forward to you guidance once again.

Thank you!
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#7

18 Nov 2016, 02:45

I guess the issue you refer to with renaming with the reconversion to long is because you have numeric identifiers from each dataset that overlap. That is, each runs from 1-x, and after merging there is no longer a unique connection between numbers and labels. That could be fixed by e.g., adding 100 to all values for item codes in dataset for tools, 200 in the durable goods etc. Another option would be to replace numerical values with the value label, in a string format (decode, link).

More important, however, is what kind of datashape you are actually trying to end up with. The current datasets are laid out in a way that is more suited for working in a spreadsheet.

You could keep your datasets wide after merging. That way you'd have individual variables for each of the numbers of tools etc that each household has. You'd have to figure out what you'd want to do with each tool (or other item's) properties such as 'is it a tiltool', 'value' etc. Do you want to keep individual variables of all properties for each item, or make compound variables, such as total number of tiltool, total value etc, for each household?. This datashape would make most sense with most analyses done in Stata, although, unlike a spreadheet format, it's rather difficult to read

You could also append, and before you do so, rename variables in each dataset. Example: ftool and dasset could both be renamed 'item', wheeras numgood and numtool could be 'number'.
This approach would make you end up with something more like what I think you are trying to achieve. This would also be closer to the layout of the individual datasets, but that format is not particularly suited to much analyses I could think of in Stata.

To help you think about this: suppose you'd want to find a relation between how much land a farmer has versus other property. With the wide layout, you could regress land ownership versus number of telephones, number of TV's etc, or total value of durable goods. In the long layout, you could only regress e.g., number of televions versus value of televisions, or more to the point, you'd be limited to regressing properties for individual item categories against each other. Probably not what you want.
Comment
Francis Dompae

Join Date: Nov 2016

Posts: 7
#8

19 Nov 2016, 21:01

Hi Jorrit,

Thanks for sharing the detailed comments which triggered a deeper reflection of what I am seeking to do, and eventually got me late in getting back with a response. Apologies for that!

I am setting up the data to do a household level analysis but for each crop. It would be inappropriate and complicated to have all plots or crops put together at the household level because a good number of the households have multiple plots and produce 2 or more of 4 crops that I am investigating - Maize, Millet, Rice and Sorghum. Some of these households also produce the same crop in both major and minor seasons. So I tagged the observations at household-plot-crop level by identifying the households that cultivated any of the crops on at least one plot in at least one season.

So, after considering the issues you flagged and looking closely at the dta, it will be easier for me to keep the data in wide form with its less merging complexities. As to whether that will suit the analysis I want to do, I am not so sure about that now especially after your comments. For this reason, and pending additional suggestions, I will like to merge the data and keep separate files in both the wide and long formats so I can compare results of the analysis to see which is appropriate. Luckily, the code that Chrisk provided has proved successful in terms of the outlook of the merged data and what I wanted and originally described, even though I don't completely understand how stata worked with it in the merging. My fear is that tagging the cases the way I did may be complicating the merging without a trace if I am not careful.

If I can get the data shape type right, it will be helpful and I can now carry out the appropriate merge. Thanks again for the helpful ideas that you continue share on this.

Francis
Comment
Francis Dompae

Join Date: Nov 2016

Posts: 7
#9

19 Nov 2016, 21:35

Hi Chrisk,

I revisited the code you suggested to examine the output in detail to understand better how it works. Given that the new variable I create (ident ) is a product of the two identifiers in each of the 5 data files which are different in terms of the pairing combinations, won't that result in some distortions in the eventual merged data? This was the initial impression I got running the code the first time when I did not get zero from the duplicates list of ident.

This time round, I noted that hhno and ident identify the observations, but ftool and ident do not. Does this mean that the command standardizes the two original identifiers in each of the paired case with respect to hhno? The two original unique identifiers (e.g hhno ftool) also still identify the data in each dataset. Since hhno and ident identify the data, I repeated the process as you suggested using the same name ident in all the 5 files to have it in all of them. I then used it alongside hhno to merge all the files.

Was this what you meant I do? That was what I meant in the initial posting when I referred to proceeding with the merging if I can be assured that concatenating the 2-paired identifiers of each file into 1 unique identifier will be centered on hhno.

The merging went well seemingly and looks straight forward but I cannot figure out how stata did it with the involvement of ident whose values look identical across many observations. Any additional notes on the code, especially how the option punct (" ") works beyond will be very helpful and educating.

Thank you!
Francis
Comment

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

#10

20 Nov 2016, 02:09

Are you sure this is what you need?
The code Chris provided would have worked out if both identifiers are the same across all datasets, but not here. Chris' also suggested to use the newly created identifier variable to merge, not the new id var plus one of the old ones. Stata will merge it in such a way, but the result wont be terribly useful.
With the data examples you provided, for example, you will end up with a number of lines (rows) per household. The new ident value 101001002 4 will give one row with the number of radios and the number of hoes for hh 101001002, whereas ident 101001002 5 will be a row with info on televisions and axes. The same logic will apply to different farmland types.
With such a datashape, you would be able to correlate or regress farmland type 4 on the number of radios and hoes, but not on the number of any other tools or durable goods, as those values will be missing in those rows.
For almost any analyses I can think of in Stata, you would want your data wide here. One line per household only. If you do not want to work with compound variables such as all farmland, than that's fine, keep the info on all farmland, tool, goods type etc. Still, it'd have to be one line per household. The only exception is if you have repeated observations for individual households over time (if the survey is repeated multiple years).

The code Chris concatenates the values of the two variables (more in this link). The punct(" ") option tells Stata to include a single space between values of the original variables.

Maybe it helps here to show what the merge you have done results in, using only the example datasets you provided. Using such smaller sets is usually easier to see what you are doing than working on the complete datasets right from the start. Is this what you wanted?

Code:

. list hhno dasset numgood gowner1 newvar ftool numtool tiltool spraymach

     +------------------------------------------------------------------------------------------------------------------+
     |      hhno          dasset   numgood   gowner1        newvar                 ftool   numtool   tiltool   spraym~h |
     |------------------------------------------------------------------------------------------------------------------|
  1. | 101001002    Mobile phone         1         1    1.01e+08 3                     .         .         .          . |
  2. | 101001002           Radio         1         1    1.01e+08 4                   Hoe         1        No         No |
  3. | 101001003   Non-farm land         1         1    1.01e+08 1                     .         .         .          . |
  4. | 101001003           Radio         2         1    1.01e+08 4                   Hoe         3        No         No |
  5. | 101001003      Television         1         1    1.01e+08 5                   Axe         7        No         No |
     |------------------------------------------------------------------------------------------------------------------|
  6. | 101001004   Non-farm land         3         1    1.01e+08 1                     .         .         .          . |
  7. | 101001004    Mobile phone         1         1    1.01e+08 3                     .         .         .          . |
  8. | 101001004           Radio         1         1    1.01e+08 4                   Hoe         2        No         No |
  9. | 101001009    Mobile phone         1         1    1.01e+08 3                     .         .         .          . |
 10. | 101001009           Radio         1         .    1.01e+08 4                     .         .         .          . |
     |------------------------------------------------------------------------------------------------------------------|
 11. | 101001002               .         .         .   1.01e+08 14               Cutlass         1        No         No |
 12. | 101001003               .         .         .   1.01e+08 14               Cutlass         3        No         No |
 13. | 101001003               .         .         .    1.01e+08 6                  Rake         2        No         No |
 14. | 101001003               .         .         .    1.01e+08 8              Pick axe         2        No         No |
 15. | 101001004               .         .         .   1.01e+08 14               Cutlass         1        No         No |
     |------------------------------------------------------------------------------------------------------------------|
 16. | 101001004               .         .         .    1.01e+08 9   Sickle/reaping hook         3        No         No |
     +------------------------------------------------------------------------------------------------------------------+

Comment

Francis Dompae

Join Date: Nov 2016

Posts: 7
#11

20 Nov 2016, 19:50

Hi Jorrit,

Yes, originally I thought this was the shape stata files for analysis should be, perhaps influenced into thinking that way because I accessed the data in the long format. I now understand the logic as the data came in long shape presumably because the study for which the data was collected is aimed at a panel but what I am using is the baseline/wave 1. With this understanding now from your detailed explanations and reassessment of the data, I think I am fine now and will reshape to wide and work with that.

Thank you for the exposition of the code, the many reference links you introduced me to, the patience and the great amount of time you spent in assisting me find a way out!

Francis
Comment
Flozzy Opondo

Join Date: Nov 2016

Posts: 3
#12

21 Nov 2016, 02:46

Hi, i have data on income earned from various agricultural output and i would like to analyze the variations of income earned from those products.Which is the most appropriate model that will give me robust results?
Thank you.
Flozzy
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#13

21 Nov 2016, 02:57

Hi Flozzy,
please start a new thread for new questions. Also provide more details on your data, e.e.g, if it has repeated observations over time and/or different geographical regions.
Comment
Flozzy Opondo

Join Date: Nov 2016

Posts: 3
#14

21 Nov 2016, 03:48

Thanks Jorrit for the quick response, though your first comment on starting new thread isn't clear. N'way as i figure out that may be some more clarification on the data.It is a cross sectional data collected from two different regions ( Counties) between May to June 2016, though the information is for the last production year that is 2015. The main crops of comparisons include, cassava , maize , beans, sorghum and sweet potatoes.I'll appreciate your indulgence.
Flozzy
Comment
Francis Dompae

Join Date: Nov 2016

Posts: 7
#15

21 Nov 2016, 04:37

Hi Flozzy,

What Jorrit meant is that your post is a reply to someone else question posted for discussion and being discussed, which is Francis Dompae in this case. Rather than post your problem this way, post it as a new question independent of a discussion thread of an existing post. Please, read the FAQ and you will be able to follow and re-post your problem for help.

Thank you!
Comment

Announcement