Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to merge data in stata

    hello eyeryone,

    i have a few data sets and need to merge them (it is panel data):

    data set 1:
    id date age income ....

    data set 2 :
    id date address dummy_1 dummy_2

    data set 3:
    id date address dummy_1 dummy_2



    data set 1 should contain all observation (id, date).
    In data set 2 you get the information about 50% of the observations (id, date) and in data set 3 the other 50 %, so I guess the main data set I need to take is the first one.
    So, how can I merge the data, so that I have just one data set and how can I control it?

    thanks a lot in advance!!
    lisa
    Last edited by lisa bäcker; 19 Aug 2015, 01:47.

  • #2
    i tried it like this:
    use C:\Main_Document.dta
    order id date
    drop Dummy_9
    drop country
    sort id date
    save C:\Main_Document2.dta, replace
    use C:\First_Document.dta
    order id date
    drop Dummy_8
    label var Dummy_7 “Income“
    save C:\First_Document2.dta, replace
    merge id date C:\Main_Document2.dta save C:\First_Document2.dta, replace
    save C:\merged_data, replace

    but it doesnt work. stata tells me:

    (note: you are using old merge syntax; see [D] merge for new syntax)
    master data not sorted

    would be great to get a response
    lisa

    Comment


    • #3
      The clue is in the error message
      'master data not sorted' -> your master data is 'first_document2'. Your syntax does not include a command for sorting this dataset.
      'first_document2' is your master data because this is the file that is open when you run the merge command.

      I would also recommend following Stata's advice and reading the manual section on merging data. This includes several helpful examples
      From what I gather from your description, 'merge 1:1' is the command for you as each case is present only once in each dataset
      Code:
      use "C:\Main_Document2.dta", clear
      merge 1:1  id date using "C:\First_document2.dta", keep(master)
      tab _merge
      The option 'keep(master)' tells Stata to keep all cases in your master data set (main_document2), also if there is no match with cases in the data sets you are merging with.

      Code:
      use "C:\First_Document2.dta", clear
      merge 1:1  id date using "C:\Main_document2.dta", keep(using)
      Should give you the same data set. Now 'main_document' is the using data and 'first_document2' the master data.

      Comment


      • #4
        great I will try it! thanks!!
        but is it possible to merge more data sets at the same time in my master data?
        because the observations for my variables I need to merge in the master data are seperated in 10 data sets.
        dont know how to handle this.
        Last edited by lisa bäcker; 19 Aug 2015, 03:11.

        Comment


        • #5
          I think this should do the trick
          Code:
           use "C:\Main_Document2.dta", clear  
           merge 1:1  id date using "C:\First_document2.dta" "C:\Second_document2.dta" , keep(master)
          But you will need to sort your datafiles on id and date first

          Please note that I'm assuming noone entered the panel after the first round.
          If this did happen and you want to include those cases in your merged data set you should drop the 'keep(master)' option
          Last edited by Evelyn Ersanilli; 19 Aug 2015, 03:09.

          Comment


          • #6

            thank you very much!

            if I use keep(master) i get this result:

            master only (1) | 4,326,566


            when i do not use it, I get this one:
            master only (1) | 4,326,566
            matched (3) | 458,450


            Do I understand it right: matched (3) means, that it worked perfectly and all data from my second data set is now in the data master set. I do not need to make another test to be sure, right?

            Thanks so much! You helped me so much!
            lisa

            Comment


            • #7
              It seems I gave you the wrong advise on the keep option.
              I looked it up:

              keep(master): observation appeared in master only
              keep(using): observation appeared in using only
              keep(match): observation appeared in both

              If you want to preserve the cases that drop out in later waves, you should not add the 'keep(master) option.
              If you only want to keep cases that were present in all waves of the panel you can add 'keep(match)', however if you do this you cannot determine if there was attrition bias.
              Last edited by Evelyn Ersanilli; 19 Aug 2015, 05:17.

              Comment


              • #8
                thanks for your help,

                but it still doesn't work.
                if I try to merge more than one data set I get this response: invalid '"C:\Second_Document2.dta'
                any idea how I can solve the problem?
                thank you!!

                Comment


                • #9
                  Start off by posting your full code + error message.

                  Comment


                  • #10
                    use C:\Main_Document.dta
                    order id date
                    drop Dummy_9
                    drop country
                    sort id date
                    save C:\Main_Document2.dta, replace

                    use C:\First_Document.dta
                    order id date
                    drop Dummy_8
                    label var Dummy_7 “Income“
                    save C:\First_Document2.dta, replace

                    use C:\Second_Document.dta
                    order id date
                    drop Dummy_8
                    label var Dummy_7 “Income“
                    save C:\Second_Document2.dta, replace

                    use C:\Third_Document.dta
                    order id date
                    drop Dummy_8
                    label var Dummy_7 “Income“
                    save C:\Third_Document2.dta, replace

                    merge 1:1 id date using "C:\First_document2.dta" "C:\Second_Document2.dta" "C:\Third_Document2.dta"
                    tab _merge


                    (note that First, Second and Third dataset contains the same Information, but for different individuals, Master document contains all individuals)

                    I get the error message:

                    merge 1:1 id date using "C:\First_document2.dta" "C:\Second_Document2.dta" "C:\Third_Document2.dta"
                    invalid '"C:\Second_Document2.dta'
                    r(198);

                    end of do-file

                    r(198);

                    Comment


                    • #11
                      Please post your code in code format (see Stata list FAQs).

                      If you leave out '1:1' do you get what you want?
                      Code:
                      merge id date using "C:\First_document2.dta" "C:\Second_Document2.dta" "C:\Third_Document2.dta"
                      Otherwise you can merge the files one at a time.
                      Code:
                      use "C:\Main_Document2.dta", clear
                      foreach file in First Second Third {
                      merge 1:1 id using `file'_Document2.dta
                      rename _merge merge`file'
                      }
                      Each merge will produce a _merge variable. Because you cannot have multiple variables by the same name you need to rename them in a structured way that allows you to connect them back to the files you are merging.

                      Comment


                      • #12
                        For those that are puzzled, the merge syntax was changed (for the better) with the introduction of Stata 11 and the functionality of specifying multiple using files was removed.

                        My reading of the original post and the following comment in #10

                        (note that First, Second and Third dataset contains the same Information, but for different individuals, Master document contains all individuals)
                        suggests that the best approach here is to use append to combine datasets with the same variables but for different individuals and then merge the combined dataset with the main dataset. Something like

                        Code:
                        * --------- create toy datasets that match description in post -----------------
                        clear
                        input id date age income
                        1 2001 20 55
                        1 2002 21 57
                        2 2001 30 40
                        2 2002 31 45
                        end
                        save "Main_Document.dta", replace
                        
                        clear
                        input id date str20 address dummy_1 dummy_2
                        1 2001 "here" 1 1
                        1 2002 "here" 1 0
                        end
                        save "First_Document.dta", replace
                        
                        clear
                        input id date str20 address dummy_1 dummy_2
                        2 2001 "there" 0 0
                        2 2002 "there" 1 1
                        end
                        save "Second_Document.dta", replace
                        
                        
                        * --------- append datasets from different individuals and then merge ----------
                        
                        clear
                        append using "First_Document.dta"
                        append using "Second_Document.dta"
                        
                        merge 1:1 id date using "Main_Document.dta"
                        
                        list
                        Note that you can achieve the same using merge with the update option but I prefer the append/merge solution because you are less likely to miss the error if it happens that there are individuals that appear in more than one dataset. It's also harder to track each individual merge results (_merge variable). This would look like

                        Code:
                        clear
                        use "Main_Document.dta"
                        merge 1:1 id date using "First_Document.dta", nogen
                        merge 1:1 id date using "Second_Document.dta", update
                        
                        list

                        Comment


                        • #13
                          thank you so much, Robert!
                          That's exactly what I was looking for !

                          Comment


                          • #14
                            my data set is huge, it's getting confusing..

                            I just got another error message again:

                            variables id date do not uniquely identify observations in the using data
                            r(459);
                            end of do-file
                            r(459);

                            I think there are more observations in the douments than in the master data set, but I just need the observations for them in the master data set.
                            is there a command I can use?

                            thank you very much in advance!

                            Comment


                            • #15
                              If you are following my append/merge example, then the using dataset is your "Main_Document.dta". You need to understand why you have multiple observations with the same values for the variables id and date in "Main_Document.dta". If you are satisfied that these make sense (for example, an individual can have more than one job in a year) and that you want to pair these with annual observations from the appended datasets, then all you need if to modify the merge to allow for multiple observations per key variables (in this case id date).

                              If you want the final dataset to ignore observations from the documents for values of id and date not found in the "Main_Document.dta", then you add the option to keep only the observations that match or that come from the using dataset ("Main_Document.dta") to the merge as well. Here's a revised example

                              Code:
                              * --------- create toy datasets that match description in post -----------------
                              clear
                              input id date age income jobid
                              1 2001 20 55 1
                              1 2001 20 55 2
                              1 2002 21 57 1
                              2 2001 30 40 1
                              2 2002 31 45 1
                              end
                              save "Main_Document.dta", replace
                              
                              clear
                              input id date str20 address dummy_1 dummy_2
                              1 2001 "here" 1 1
                              1 2002 "here" 1 0
                              end
                              save "First_Document.dta", replace
                              
                              clear
                              input id date str20 address dummy_1 dummy_2
                              2 2001 "there" 0 0
                              2 2002 "there" 1 1
                              3 2001 "whothat" 1 1
                              end
                              save "Second_Document.dta", replace
                              
                              
                              * --------- append datasets from different individuals and then merge ----------
                              
                              clear
                              append using "First_Document.dta"
                              append using "Second_Document.dta"
                              
                              merge 1:m id date using "Main_Document.dta", keep(match using)
                              
                              sort id date
                              list

                              Comment

                              Working...
                              X