Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Running the same .do-file multiple times results each time in a different amount of observations after merging 3 datasets

    Dear Statalists,

    Running the same .do-file multiple times, without making any adjustments by hand between those runs, results each time in a different amount of observations.
    After merging three datasets, the outcome of total obs. keeps changing.

    Below is an example of what I do after the 1st merge with the masterfile called Example.dta
    Maybe this is the reason of the strange problem.

    Save Example1.dta, replace
    use Likewise.dta, clear
    drop useless variables
    rename some variables
    sort key variables
    save Likewise1.dta, replace
    use Example1.dta, clear
    sort key variables

    merge based on the key variables (of both datasets) using Likewise1.dta

    4 'endresults': 29,551 obs; 29532; 29563; 29588
    Last edited by LydiaSmit; 09 Jul 2014, 23:07.

  • #2
    This is likely due to the -sort- command. Adding the -stable- option is likely to fix it.
    __________________________________________________ __
    Assistant Professor, Department of Biostatistics and Epidemiology
    School of Public Health and Health Sciences
    University of Massachusetts- Amherst

    Comment


    • #3
      I alos suspect -sort- is the problem. But I don't think the -stable- option will solve it. Instead I think the -stable- option will just hide the problem with the identifying variables. So I would strongly recommend not to use the -stable- option. Instead she should take a close look at the identifying variables and find the real problem. She can find useful material on how to diagnose possible problems here:
      http://blog.stata.com/2011/04/18/mer...rges-gone-bad/
      http://blog.stata.com/2011/05/27/mer...le-key-merges/

      Appart from that Lydia is using old syntax for -merge-. She can avoid -sort-ing altogether by using the new syntax.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        I'll certainly defer to Maarten's expertise on this!
        __________________________________________________ __
        Assistant Professor, Department of Biostatistics and Epidemiology
        School of Public Health and Health Sciences
        University of Massachusetts- Amherst

        Comment


        • #5
          I also recommended -sort, stable- in this thread. It turned out to be bad advice, because the user did have a problem that stable was hiding rather than solving. I won't give that advice again.

          http://www.statalist.org/forums/foru...-ttest-command
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          Stata Version: 17.0 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            Thank you for trying to help me.

            Maarten, I already checked those exact two sites before I decided to merge on multiple key variables for my first merge. However, I didn't understand an important part.
            by id: assert _N==1 The writer, William Gould, used that command probably to check the uniqueness as he described before, however, when I used that command a few days ago. Nothing happened.

            use one, clear . merge 1:1 id sex using two . sort id . by id: assert _N==1 . keep if _merge==3
            By the way, I also checked the following two sites for my merges (about a week ago), which Konrad Zdeb advised me today. "Merging in Stata is neatly explained here and here."
            http://www.statalist.org/forums/foru...t-consequences

            I used http://www.ats.ucla.edu/stat/stata/faq/multmerge.htm to do my merges but I combined that method with adding an extra key variable for 2 of my 3 merges. (So using 2 key variables for those two merges as explained in the links you, Maarten, gave me.

            These three merges, of which the first one is highly likely to be succesful, are the first merges I ever made in Stata. (Never learnt anything about merges at college, work etcetera.)

            Comment


            • #7
              I haven't been following this carefully, but you may want to check out the -isid- command. isid checks whether the specified variables uniquely identify the observations. So, you would do something like

              isid var1 var2

              If there is a problem, you will get the message

              Code:
              . isid var1 var2
              variables var1 var2 do not uniquely identify the observations
              r(459);
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              Stata Version: 17.0 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                On one detail: assert says nothing if the statement is true. Try some tautology:

                Code:
                assert 42 == 42
                You will get no output. The flavour of assert is: something should be true, so check if it is. If not, stop me with the news; otherwise let me carry on.

                Or more succinctly, no news is good news.

                For more, with examples, see William Gould at http://www.stata-journal.com/sjpdf.h...iclenum=dm0003

                Comment


                • #9
                  If the number of merged observations changes at each run of the do-file, then the datasets used for the merge have changed so you must look upstream for what has happened. Since you are saving each dataset before performing the merge, you can use the cf command to quickly check for changes between do-file runs. For example, before saving a new version, you would include in your do-file

                  Code:
                  cf _all using Example1.dta, all
                  save Example1.dta, replace
                  If you do this for all new versions of your datasets, you should quickly be able to identify which dataset has changed.

                  As to why the dataset has changed, as others have pointed out, this is usually because of an incomplete sort. Look back at all the sort commands in the do-file and replace them with isid varlist, sort commands. For example, replace

                  Code:
                  sort country id
                  with

                  Code:
                  isid country id, sort

                  Comment


                  • #10
                    Thank you for the replies. isid and especially assert seem to be very useful.

                    Could a m:m merge lead to this problem of mine? (different endresult of obs while running the same .do-file over and over).

                    Robert, your reply seems to be the solution for my problem. Thank you very much. I'll check it after I learnt more about the joinby-command which you advised me in my other topic.

                    Comment


                    • #11
                      Sure, another problem with m:m merge is that within groups of the key variable used to perform the merge, the order of observations is not taken into account. After any merge, you must perform a sort to get the observations back in order. If you follow a regular merge with a m:m merge, the second merge will sort your data only based on the key variable. That means that at each run, the order of observations within each group will change. Try running the following example many times, the results should change at each run. Yet another reason why you should never use m:m merge!

                      Code:
                      clear
                      input str1 id x
                      A 1
                      A 2
                      A 3
                      B 2
                      end
                      tempfile fx
                      qui save "`fx'"
                      list, sepby(id) noobs
                      
                      clear
                      input str1 id y
                      A 11
                      A 12
                      A 13
                      A 14
                      B 12
                      B 13
                      B 14
                      end
                      tempfile fy
                      qui save "`fy'"
                      list, sepby(id) noobs
                      
                      use "`fx'", clear
                      bysort id (x): keep if _n == 1
                      list, sepby(id) noobs
                      merge m:m id using "`fy'", keep(master match) nogen
                      list, sepby(id) noobs
                      
                      merge m:m id using "`fy'", keep(master match) nogen
                      list, sepby(id) noobs

                      Comment


                      • #12
                        Sorry but the last merge in my previous example did not do anything as I forgot to rename variables. This is what I was trying to illustrate:

                        Code:
                        clear
                        input str1 id x
                        A 1
                        B 2
                        end
                        tempfile fx
                        qui save "`fx'"
                        
                        clear
                        input str1 id y
                        A 11
                        A 12
                        A 13
                        A 14
                        B 12
                        B 13
                        B 14
                        end
                        tempfile fy
                        qui save "`fy'"
                        
                        use "`fx'", clear
                        merge 1:m id using "`fy'", keep(master match) nogen
                        list, sepby(id) noobs
                        
                        rename y y0
                        merge m:m id using "`fy'", keep(master match) nogen
                        list, sepby(id) noobs

                        Comment


                        • #13
                          Thank you for the great help. By the way, your command "isid country id, sort" gave me the following error: variables myvarname1 myvarname 2 should never be missing. r(459).

                          Based on what you said earlier. "Appart from that Lydia is using old syntax for -merge-. She can avoid -sort-ing altogether by using the new syntax." I only temporarily didn't specify the merge type for my 2nd merge because it seemed to result in a perfect match without chosing a merge type. About 9 of the 10 sites I checked with the new syntax method use the sort command like I did. For example, Indiana University https://kb.iu.edu/d/azck

                          Could someone tell me why they still use the the risky, unnecessary sort-command?

                          By the way, even without using the sort-command at all, I'm still getting a different total amount of obs after each run of the same .do-file
                          cf _all helped me finding the origin of this problem, however, I haven't solved it yet because the cause isn't clear yet.

                          Last edited by LydiaSmit; 10 Jul 2014, 22:46.

                          Comment


                          • #14
                            Below is an example of what I do after the 1st merge with the masterfile called Example.dta

                            Save Example1.dta, replace (so after I merged the masterfile Example.dta with an other dataset I save the new merged file under Example1)
                            use Likewise.dta, clear

                            Better example...... I start with the masterfile A.dta then I open B.dta, drop and rename soms vars and save & rename B.dta to B1.dta after that I merge A.dta with B1.dta ....after that, I save & rename the merged file to A1.dta....then I use C.dta, drop and rename some vars and then save & rename C.dta to C1.dta,,,,,,then I use joinby key variable (not unique) using C1.dta

                            Like Robert suggested, I used the following:

                            cf _all using A1.dta, all save Example1.dta, replace Most of the time only 8 vars contain mismatches according to the above command cf _all.....6 of those 8 always contain mismatches...1 of those 6 is 1 of the 2 key variables used in my 1st merge. However, sometimes 22 vars contain mismatches.

                            Hopefully someone can help me further. I haven't slept at all tonight trying to solve this. I didn't even lay down. I'm even thinking about giving a very small cash reward for the person who can solve this annoying problem

                            Comment


                            • #15
                              By the way, your command "isid country id, sort" gave me the following error: variables myvarname1 myvarname 2 should never be missing. r(459).
                              That means you have missing data on the variables you are trying to sort on. You can add the -missok- option to -isid- . But is there a way to clean up the missing data?
                              -------------------------------------------
                              Richard Williams, Notre Dame Dept of Sociology
                              Stata Version: 17.0 MP (2 processor)

                              EMAIL: [email protected]
                              WWW: https://www3.nd.edu/~rwilliam

                              Comment

                              Working...
                              X