Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I merge data files with multiple identifiers that are different?

    Dear Statalisters,

    I am a new stata user currently working on a socioeconomic dataset of about 5,000 households, and need help to merge some of the data. I am trying to merge 5 files, each with two variables as unique identifiers, jointly. One of the two identifiers in each file (hhno – household id) is common across all the 5 files but the rest are different. The files and their identifiers are as follows:
    No File Identifiers
    1 Land data hhno plotno
    2 Finance data hhno finc_no
    3 Asset data hhno dasset
    4 Tools data hhno ftool
    5 Extension data hhno org_id
    I am doing plot level analysis which duplicates hhno. Except m:m merge, the other more assured merge types predictably return the error “variable not found”, respectively referring to the data item identifiers. As the m:m merge output (matches) does not look convincing, I tried using –joinby– whose output does not look convincing either. What I always expect to match in some cases don’t. Next, I tried grouping the 2 unique identifiers in each file using ege, group() concatenation command but doubt if that is reasonable and/or appropriate for my purpose.

    If I can be sure that using this -egen, group()- command to concatenate the pairs of each file into 1 unique id will be centered on the hhno, I can proceed with the merging, but I don’t get that impression in both reading and experimenting with that. What I want to do (merging) seems basic but I am stuck, and hope and look forward to your kind assistance guiding me on how to merge the data.

    Thank you!
    Francis

  • #2
    Hi Francis, its not completely clear from the table what you need doing.

    However at a guess, one thing to try is to create a new variable that concatenates the two identifiers:

    Code:
    egen newvar = concat(identifier1 identifier2), punct(" ")
    duplicates list new var
    The merge requires completely unique identifiers, otherwise stata would not know what to merge. Hopefully when you list duplicates there should be 0 and you can repeat the process for the other file you wish to merge.

    Comment


    • #3
      This question could be made a lot clearer if you would provide more detail on the variables in each set (or two of them, to see the logic). Please provide data examples with dataex (see the faq).
      Pending more info, there's a number of possible guesses. Its possible that you'd need to merge on hhid only.
      Another option is that you aren't seeking to merge, but to append (more in this link)
      It's also possible that some form of reshaping individual datasets is needed before a merge is sensible.

      Comment


      • #4
        Thank you very much Chrisk and Jorrit for the quick and helpful comments! Sorry for the lack of clarity in the specification of my problem but I think your guesses have pointed to the solution. I had already appended some of the files which are captured and stored by 2 different farming seasons with same questions but capturing different households. So, I experimented with the code and reshaping the data as suggested. The files are in long format so I reshaped to wide and hhno becomes unique for the merging but that means I have to reshape again to long upon a successful merging. Samples of the Tools and Asset data which I just experimented with are as follows:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long hhno byte(ftool numtool) float(tiltool spraymach)
        101001002  4 1 0 0
        101001002 14 1 0 0
        101001003  4 3 0 0
        101001003  5 7 0 0
        101001003  6 2 0 0
        101001003  8 2 0 0
        101001003 14 3 0 0
        101001004  4 2 0 0
        101001004  9 3 0 0
        101001004 14 1 0 0
        end
        label values ftool s3aii_0
        label def s3aii_0 4 "Hoe", modify
        label def s3aii_0 5 "Axe", modify
        label def s3aii_0 6 "Rake", modify
        label def s3aii_0 8 "Pick axe", modify
        label def s3aii_0 9 "Sickle/reaping hook", modify
        label def s3aii_0 14 "Cutlass", modify
        label values tiltool tiltool
        label def tiltool 0 "No", modify
        label values spraymach spraymach
        label def spraymach 0 "No", modify

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long hhno byte dasset int numgood byte gowner1
        101001002 3 1 1
        101001002 4 1 1
        101001003 1 1 1
        101001003 4 2 1
        101001003 5 1 1
        101001004 1 3 1
        101001004 3 1 1
        101001004 4 1 1
        101001009 3 1 1
        101001009 4 1 .
        end
        label values dasset dasset
        label def dasset 1 "Non-farm land", modify
        label def dasset 3 "Mobile phone", modify
        label def dasset 4 "Radio", modify
        label def dasset 5 "Television", modify
        label values gowner1 s3aiii_b1
        Your support is highly appreciated and I thank you once again!
        Francis

        Comment


        • #5
          Good to hear the suggestions helped.
          Now I'm no sure is this problem is solved or if you are still stuck somewhere in the process? Would you still need to merge the datasets provided? It does sound like you figured out which steps to take now.

          Comment


          • #6
            Thank you Jorrit for following up on my issue. I am still a bit stuck in the process of merging the data.

            Yes I still need to merge the datasets provided and have been trying to do that. I reshaped the files to wide, merged 1:1 using hhno (keeping all 3 matched types) and reshaped back to long, but I am not confident with the output because I am not quite sure of the commands I used. I run the following commands/outputs in the process:

            . use Durablegoods, clear
            . reshape wide numgood gowner1 valgood, i(hhno) j(dasset)
            (note: j = 1 2 3 4 5 6)

            Data long -> wide
            -----------------------------------------------------------------------------
            Number of obs. 9749 -> 4261
            Number of variables 7 -> 21
            j variable (6 values) dasset -> (dropped)
            xij variables:
            numgood -> numgood1 numgood2 ... numgood6
            gowner1 -> gowner11 gowner12 ... gowner16
            valgood -> valgood1 valgood2 ... valgood6
            -----------------------------------------------------------------------------

            . save Durablegoods_wide,replace
            file Durablegoods_wide.dta saved

            . use Farmtools, clear
            . reshape wide numtool towner1 valtool tiltool cuttool engtool axtool spraymach cartool, i(hhno) j(ftool
            > )
            (note: j = 1 2 4 5 6 8 9 10 11 14 15 16 19)

            Data long -> wide
            -----------------------------------------------------------------------------
            Number of obs. 9569 -> 3839
            Number of variables 14 -> 121
            j variable (13 values) ftool -> (dropped)
            xij variables:
            numtool -> numtool1 numtool2 ... numtool19
            towner1 -> towner11 towner12 ... towner119
            valtool -> valtool1 valtool2 ... valtool19
            tiltool -> tiltool1 tiltool2 ... tiltool19
            cuttool -> cuttool1 cuttool2 ... cuttool19
            engtool -> engtool1 engtool2 ... engtool19
            axtool -> axtool1 axtool2 ... axtool19
            spraymach -> spraymach1 spraymach2 ... spraymach19
            cartool -> cartool1 cartool2 ... cartool19
            -----------------------------------------------------------------------------

            . save Farmtools_wide, replace
            file Farmtools_wide.dta saved

            . merge 1:1 hhno using Durablegoods_wide
            (label id1 already defined)

            Result # of obs.
            -----------------------------------------
            not matched 1,558
            from master 568 (_merge==1)
            from using 990 (_merge==2)

            matched 3,271 (_merge==3)
            -----------------------------------------

            . reshape long numtool towner1 valtool tiltool cuttool engtool axtool spraymach cartool numgood gowner1
            > valgood, i(hhno) j(ftool)
            (note: j = 1 2 3 4 5 6 8 9 10 11 14 15 16 19)

            Data wide -> long
            -----------------------------------------------------------------------------
            Number of obs. 4829 -> 67606
            Number of variables 141 -> 19
            j variable (14 values) -> ftool
            xij variables:
            numtool1 numtool2 ... numtool19 -> numtool
            towner11 towner12 ... towner119 -> towner1
            valtool1 valtool2 ... valtool19 -> valtool
            tiltool1 tiltool2 ... tiltool19 -> tiltool
            cuttool1 cuttool2 ... cuttool19 -> cuttool
            engtool1 engtool2 ... engtool19 -> engtool
            axtool1 axtool2 ... axtool19 -> axtool
            spraymach1 spraymach2 ... spraymach19 -> spraymach
            cartool1 cartool2 ... cartool19 -> cartool
            numgood1 numgood2 ... numgood19 -> numgood
            gowner11 gowner12 ... gowner119 -> gowner1
            valgood1 valgood2 ... valgood19 -> valgood
            -----------------------------------------------------------------------------

            What should I do so as not to lose the secondary identifiers (ftool & dasset) as the first conversion to wide drops them and the re-conversion to long after merging distorts the j(ftool) I specified. I also observed missing values that seem arbitrary in some variables. Where did I go wrong and what would you suggest I do differently? I look forward to you guidance once again.

            Thank you!

            Comment


            • #7
              I guess the issue you refer to with renaming with the reconversion to long is because you have numeric identifiers from each dataset that overlap. That is, each runs from 1-x, and after merging there is no longer a unique connection between numbers and labels. That could be fixed by e.g., adding 100 to all values for item codes in dataset for tools, 200 in the durable goods etc. Another option would be to replace numerical values with the value label, in a string format (decode, link).

              More important, however, is what kind of datashape you are actually trying to end up with. The current datasets are laid out in a way that is more suited for working in a spreadsheet.

              You could keep your datasets wide after merging. That way you'd have individual variables for each of the numbers of tools etc that each household has. You'd have to figure out what you'd want to do with each tool (or other item's) properties such as 'is it a tiltool', 'value' etc. Do you want to keep individual variables of all properties for each item, or make compound variables, such as total number of tiltool, total value etc, for each household?. This datashape would make most sense with most analyses done in Stata, although, unlike a spreadheet format, it's rather difficult to read

              You could also append, and before you do so, rename variables in each dataset. Example: ftool and dasset could both be renamed 'item', wheeras numgood and numtool could be 'number'.
              This approach would make you end up with something more like what I think you are trying to achieve. This would also be closer to the layout of the individual datasets, but that format is not particularly suited to much analyses I could think of in Stata.

              To help you think about this: suppose you'd want to find a relation between how much land a farmer has versus other property. With the wide layout, you could regress land ownership versus number of telephones, number of TV's etc, or total value of durable goods. In the long layout, you could only regress e.g., number of televions versus value of televisions, or more to the point, you'd be limited to regressing properties for individual item categories against each other. Probably not what you want.

              Comment


              • #8
                Hi Jorrit,

                Thanks for sharing the detailed comments which triggered a deeper reflection of what I am seeking to do, and eventually got me late in getting back with a response. Apologies for that!

                I am setting up the data to do a household level analysis but for each crop. It would be inappropriate and complicated to have all plots or crops put together at the household level because a good number of the households have multiple plots and produce 2 or more of 4 crops that I am investigating - Maize, Millet, Rice and Sorghum. Some of these households also produce the same crop in both major and minor seasons. So I tagged the observations at household-plot-crop level by identifying the households that cultivated any of the crops on at least one plot in at least one season.

                So, after considering the issues you flagged and looking closely at the dta, it will be easier for me to keep the data in wide form with its less merging complexities. As to whether that will suit the analysis I want to do, I am not so sure about that now especially after your comments. For this reason, and pending additional suggestions, I will like to merge the data and keep separate files in both the wide and long formats so I can compare results of the analysis to see which is appropriate. Luckily, the code that Chrisk provided has proved successful in terms of the outlook of the merged data and what I wanted and originally described, even though I don't completely understand how stata worked with it in the merging. My fear is that tagging the cases the way I did may be complicating the merging without a trace if I am not careful.

                If I can get the data shape type right, it will be helpful and I can now carry out the appropriate merge. Thanks again for the helpful ideas that you continue share on this.

                Francis

                Comment


                • #9
                  Hi Chrisk,

                  I revisited the code you suggested to examine the output in detail to understand better how it works. Given that the new variable I create (ident ) is a product of the two identifiers in each of the 5 data files which are different in terms of the pairing combinations, won't that result in some distortions in the eventual merged data? This was the initial impression I got running the code the first time when I did not get zero from the duplicates list of ident.

                  This time round, I noted that hhno and ident identify the observations, but ftool and ident do not. Does this mean that the command standardizes the two original identifiers in each of the paired case with respect to hhno? The two original unique identifiers (e.g hhno ftool) also still identify the data in each dataset. Since hhno and ident identify the data, I repeated the process as you suggested using the same name ident in all the 5 files to have it in all of them. I then used it alongside hhno to merge all the files.

                  Was this what you meant I do? That was what I meant in the initial posting when I referred to proceeding with the merging if I can be assured that concatenating the 2-paired identifiers of each file into 1 unique identifier will be centered on hhno.

                  The merging went well seemingly and looks straight forward but I cannot figure out how stata did it with the involvement of ident whose values look identical across many observations. Any additional notes on the code, especially how the option punct (" ") works beyond will be very helpful and educating.

                  Thank you!
                  Francis

                  Comment


                  • #10
                    Are you sure this is what you need?
                    The code Chris provided would have worked out if both identifiers are the same across all datasets, but not here. Chris' also suggested to use the newly created identifier variable to merge, not the new id var plus one of the old ones. Stata will merge it in such a way, but the result wont be terribly useful.
                    With the data examples you provided, for example, you will end up with a number of lines (rows) per household. The new ident value 101001002 4 will give one row with the number of radios and the number of hoes for hh 101001002, whereas ident 101001002 5 will be a row with info on televisions and axes. The same logic will apply to different farmland types.
                    With such a datashape, you would be able to correlate or regress farmland type 4 on the number of radios and hoes, but not on the number of any other tools or durable goods, as those values will be missing in those rows.
                    For almost any analyses I can think of in Stata, you would want your data wide here. One line per household only. If you do not want to work with compound variables such as all farmland, than that's fine, keep the info on all farmland, tool, goods type etc. Still, it'd have to be one line per household. The only exception is if you have repeated observations for individual households over time (if the survey is repeated multiple years).

                    The code Chris concatenates the values of the two variables (more in this link). The punct(" ") option tells Stata to include a single space between values of the original variables.

                    Maybe it helps here to show what the merge you have done results in, using only the example datasets you provided. Using such smaller sets is usually easier to see what you are doing than working on the complete datasets right from the start. Is this what you wanted?

                    Code:
                    . list hhno dasset numgood gowner1 newvar ftool numtool tiltool spraymach
                    
                         +------------------------------------------------------------------------------------------------------------------+
                         |      hhno          dasset   numgood   gowner1        newvar                 ftool   numtool   tiltool   spraym~h |
                         |------------------------------------------------------------------------------------------------------------------|
                      1. | 101001002    Mobile phone         1         1    1.01e+08 3                     .         .         .          . |
                      2. | 101001002           Radio         1         1    1.01e+08 4                   Hoe         1        No         No |
                      3. | 101001003   Non-farm land         1         1    1.01e+08 1                     .         .         .          . |
                      4. | 101001003           Radio         2         1    1.01e+08 4                   Hoe         3        No         No |
                      5. | 101001003      Television         1         1    1.01e+08 5                   Axe         7        No         No |
                         |------------------------------------------------------------------------------------------------------------------|
                      6. | 101001004   Non-farm land         3         1    1.01e+08 1                     .         .         .          . |
                      7. | 101001004    Mobile phone         1         1    1.01e+08 3                     .         .         .          . |
                      8. | 101001004           Radio         1         1    1.01e+08 4                   Hoe         2        No         No |
                      9. | 101001009    Mobile phone         1         1    1.01e+08 3                     .         .         .          . |
                     10. | 101001009           Radio         1         .    1.01e+08 4                     .         .         .          . |
                         |------------------------------------------------------------------------------------------------------------------|
                     11. | 101001002               .         .         .   1.01e+08 14               Cutlass         1        No         No |
                     12. | 101001003               .         .         .   1.01e+08 14               Cutlass         3        No         No |
                     13. | 101001003               .         .         .    1.01e+08 6                  Rake         2        No         No |
                     14. | 101001003               .         .         .    1.01e+08 8              Pick axe         2        No         No |
                     15. | 101001004               .         .         .   1.01e+08 14               Cutlass         1        No         No |
                         |------------------------------------------------------------------------------------------------------------------|
                     16. | 101001004               .         .         .    1.01e+08 9   Sickle/reaping hook         3        No         No |
                         +------------------------------------------------------------------------------------------------------------------+

                    Comment


                    • #11
                      Hi Jorrit,

                      Yes, originally I thought this was the shape stata files for analysis should be, perhaps influenced into thinking that way because I accessed the data in the long format. I now understand the logic as the data came in long shape presumably because the study for which the data was collected is aimed at a panel but what I am using is the baseline/wave 1. With this understanding now from your detailed explanations and reassessment of the data, I think I am fine now and will reshape to wide and work with that.

                      Thank you for the exposition of the code, the many reference links you introduced me to, the patience and the great amount of time you spent in assisting me find a way out!

                      Francis

                      Comment


                      • #12
                        Hi, i have data on income earned from various agricultural output and i would like to analyze the variations of income earned from those products.Which is the most appropriate model that will give me robust results?
                        Thank you.
                        Flozzy

                        Comment


                        • #13
                          Hi Flozzy,
                          please start a new thread for new questions. Also provide more details on your data, e.e.g, if it has repeated observations over time and/or different geographical regions.

                          Comment


                          • #14
                            Thanks Jorrit for the quick response, though your first comment on starting new thread isn't clear. N'way as i figure out that may be some more clarification on the data.It is a cross sectional data collected from two different regions ( Counties) between May to June 2016, though the information is for the last production year that is 2015. The main crops of comparisons include, cassava , maize , beans, sorghum and sweet potatoes.I'll appreciate your indulgence.
                            Flozzy

                            Comment


                            • #15
                              Hi Flozzy,

                              What Jorrit meant is that your post is a reply to someone else question posted for discussion and being discussed, which is Francis Dompae in this case. Rather than post your problem this way, post it as a new question independent of a discussion thread of an existing post. Please, read the FAQ and you will be able to follow and re-post your problem for help.

                              Thank you!

                              Comment

                              Working...
                              X