Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging many data sets

    Hi there,
    I am trying to merge 150 data sets and I realized my way of merging data is not efficient. I merge as follow:

    use data1.dta, clear
    sort id
    save data1.dta, replace

    use data2.dta, clear
    sort id
    save data2.dta, replace

    merge 1:1 _n using data1.dta
    drop _merge
    sort id
    save merge1.dta, replace

    use data3.dta, clear
    sort id
    save data3.dta, replace

    merge 1:1 _n uisng merg1.dta
    drop _merge.dta
    sort id
    save merge2.dta, replace
    .
    .
    .
    .so on ....I realized its not efficient way to merge data when you have large data set.

    Then I tried to use to loop function, and I am fairly new in terms of using loop function and I tried to merge with first 10 data set and I some how failed to merge. Following is my syntax:

    local data data1-data10
    foreach x of local data{
    use `x', clear
    sort id
    save `x'.dta, replace

    merge 1:1 _n using `x'
    drop _merge
    save merge_all.dta

    end

    Above function does not merge at all. Your help will highly appreciated. Many thanks.

  • #2
    Originally posted by Ngawang Dendup View Post
    Hi there,
    I am trying to merge 150 data sets and I realized my way of merging data is not efficient. I merge as follow:

    use data1.dta, clear
    sort id
    save data1.dta, replace

    use data2.dta, clear
    sort id
    save data2.dta, replace

    merge 1:1 _n using data1.dta
    drop _merge
    sort id
    save merge1.dta, replace

    use data3.dta, clear
    sort id
    save data3.dta, replace

    merge 1:1 _n uisng merg1.dta
    drop _merge.dta
    sort id
    save merge2.dta, replace
    .
    .
    .
    .so on ....I realized its not efficient way to merge data when you have large data set.

    Then I tried to use to loop function, and I am fairly new in terms of using loop function and I tried to merge with first 10 data set and I some how failed to merge. Following is my syntax:

    local data data1-data10
    foreach x of local data{
    use `x', clear
    sort id
    save `x'.dta, replace

    merge 1:1 _n using `x'
    drop _merge
    save merge_all.dta

    end

    Above function does not merge at all. Your help will highly appreciated. Many thanks.
    I think it's best to first open all the data and sort it, then just merge all of them.

    E.g.
    Code:
    local data data1-data10
    foreach x of local data{
    use `x', clear
    sort id
    save `x'.dta, replace
    }
    
    local data data2-data10
    use data1, clear
    foreach x of local data{
    merge 1:1 id using `x', nogen
    }
    save merge_all.dta

    Comment


    • #3
      I don't think this code will work yet. If you put the text data1-data10 into a local, foreach won't expand it into a list. It's not a varlist.


      Code:
      forval k = 1/10 {
           use data`k', clear
           sort id
           save data`k'.dta, replace
      }  
      
      use data1, clear
      
      forval k = 2/10 {    
           merge 1:1 id using data`k', nogen
      }
      
      save merge_all.dta

      Comment


      • #4
        Thank you so much Jesse. I tried with your code but it is merging only data1.dta and data10.dta. Your help is really appreciated and would be grateful if you could also look into it once again. Thank you.

        Comment


        • #5
          Thank you Nick Cox, and really really appreciated for the help.

          Comment


          • #6
            There is something strange about this whole thread. After sorting out the syntax errors, Ngawang is sorting each data set on a variable id, and then serially -merge-ing them 1:1 _n.

            Evidently, if one is going to -merge 1:1 _n-, then the data sets need to be in their proper sort order before you -merge- them. But if -sort id- can accomplish that, it is simpler to just -merge 1:1 id-, which does not require that any of the data sets be pre-sorted. This will work just fine if id uniquely identifies observations in each data set.

            If id does not uniquely identify observations in each data set, then you won't be able to -merge 1:1 id-. But, if that is the case, your current procedure, which sorts on id, will produce a random shuffle of the observations within groups having the same id, a different random shuffle in each data set. Those will then be -merge-d together to produce a data set in which observations from different data sets are randomly matched with each other. Perhaps that is what you want, but it would be unusual.

            So I think you need to be clear about what you are doing. If id uniquely identifies observations, all you need is:

            Code:
            use data1, clear
            
            forvalues i = 2/10 {
                merge 1:1 id using data`i', nogenerate
            }
            If id does not uniquely identify observations, with -merge 1:1 _n- you are in the process of creating a randomly shuffled melange of data sets. Caution!

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              ...
              I recall seeing the message, "using data not sorted" quite frequently after merging. But I think that may have been using a very old Stata version (where I now discovered you can simply specify "sort" in the merge syntax to fix this...)?
              Last edited by Jesse Wursten; 17 Aug 2016, 08:49.

              Comment


              • #8
                Clyde: Good point. People who have used Stata for a while will recall that merge did require a prior sort. I have not checked when that requirement was dropped -- most likely when the syntax 1:1 etc was introduced. However, I can't see that the sorting will do any harm, which is the converse of Clyde's point that it is no longer necessary, so I can't see that there is a risk of melange. If the 1:1 merge is not possible, then the command will fail.

                Comment


                • #9
                  If id does not uniquely identify observations, then -merge 1:1 id- will fail, but -merge 1:1 _n- will not. And with the sort order within the data sets having been randomized, -merge 1:1 _n- will produce the melange I referred to.

                  Comment


                  • #10
                    OK. That was the original code in #1. I started reading the thread at #2. It wasn't included in anything I was recommending.

                    Comment


                    • #11
                      Thank you Nick, Clyde and Jesse for very useful discussion. I have solved this problem now and thanks to all of you. Highly appreciated for the response, discussion and help.

                      Comment


                      • #12
                        Originally posted by Nick Cox View Post
                        OK. That was the original code in #1. I started reading the thread at #2. It wasn't included in anything I was recommending.
                        This is my bad - I swapped the _n for id for the exact reason Clyde mentioned, but forgot to mention this in the end.

                        Comment


                        • #13
                          Jesse: Not your fault to improve code! I didn't read #1, so that was my fault.

                          Comment

                          Working...
                          X