Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to merge multiple sub-datasets of a household survey

    Dear all,

    I know that there are some threats asking about data merging issues and I have read some of them. However, I find my dataset are quite different relative to what I have read in these forums. In addition, I have tried to apply what I have learned in those threats into my datasets but unfortunately it did not work. Therefore, I decided to make a new one and followings are my data and coding information:

    In fact, I have three sub-datasets from a household survey, namely: file1 that contains demographic info; file2 contains educational info and the last one contains healthcare info. I realize that there are some common variables existed in the three files, including: province, district, commune, enumeration area, household identification and household member ID. Hence, my strategy is to base on these six variables to merge the three sub-datasets and here are my coding:

    I sorted data first
    Code:
    use "C:\Users\file2.dta", clear
    count 
    sort pro dist comm enum_area hh_ind hh_mem_id
    save "C:\Users\file2_sorted.dta", replace
    
    use "C:\Users\file3.dta", clear
    count 
    sort pro dist comm enum_area hh_ind hh_mem_id
    save "C:\Users\file3_sorted.dta", replace
    Merge data - I am using file1 as master and the others are using. I am using m:1 -merge
    Code:
    use "C:\Users\file1.dta", clear
    count 
    sort pro dist comm enum_area hh_ind hh_mem_id
    merge m:1 pro dist comm enum_area hh_ind hh_mem_id using "C:\Users\file2_sorted.dta"
    tab _merge
    keep if _merge==3
    drop _merge
    
    merge m:1 pro dist comm enum_area hh_ind hh_mem_id using "C:\Users\file3_sorted.dta"
    tab _merge
    keep if _merge==3
    The codes worked for merging file1 and file2 but failed when I tried to merge the last file, it said that "variables pro dist comm enum_area hh_ind hh_mem_id do not uniquely identify observations in the using data". I do not know why so could anyone have any ideas/suggestions to solve the problem?

    In my case, I do not know how to create a sample data using dataex so I attach my datasets here (please kindly find the attached).

    I hope I can have replies from you.

    Thank you.
    Attached Files

  • #2
    Welcome to the Stata Forum / Statalist.

    The codes from dataex are available at Stata 15.1 and can be installed - findit dataex - if you have an older version.

    Since you mentioned that you "do not know how to create a sample data using dataex", here are a few tips:

    You may type - help dataex - in the command window.

    That said, let's say you wish the first 10 observations, you will just need to type:

    Code:
    dataex varlist in 1/10
    In case you wish to provide all variables, please exclude "varlist". In case you wish to select 3 variables, you just need to replace "varlist" by the respective names.
    Best regards,

    Marcos

    Comment


    • #3
      Originally posted by Marcos Almeida View Post
      Welcome to the Stata Forum / Statalist.

      The codes from dataex are available at Stata 15.1 and can be installed - findit dataex - if you have an older version.

      Since you mentioned that you "do not know how to create a sample data using dataex", here are a few tips:

      You may type - help dataex - in the command window.

      That said, let's say you wish the first 10 observations, you will just need to type:

      Code:
      dataex varlist in 1/10
      In case you wish to provide all variables, please exclude "varlist". In case you wish to select 3 variables, you just need to replace "varlist" by the respective names.
      Dear Marcos,

      Thank you so much for the dataex information and how to use it, I now can utilize the dataex function properly . By the way, do you have any suggestions dealing with my merging matters?

      Best regards,

      Dung Le

      Comment


      • #4
        If typing - help merge - in Stata's Command Window doesn't provide a solution to you case, I recommend to use - dataex - to provide a short example of the datasets.
        Best regards,

        Marcos

        Comment


        • #5
          I'd presume Stata is referring to file 3 when it says "pro dist comm enum_area hh_ind hh_mem_id do not uniquely identify observations." To diagnose the problem, I'd suggest you open file3, and use the -duplicates- command to find out what observations do not have unique identifying information. Something like this should work:

          duplicates report pro dist comm enum_area hh_ind hh_mem_id

          One potentially tricky possibility that I have seen occurs if a file has several completely blank observations, usually at the end of the file. To Stata, such observations will have the same identifying information, i.e. the "." missing value code.
          Last edited by Mike Lacy; 14 Jun 2018, 06:58.

          Comment


          • #6
            Originally posted by Mike Lacy View Post
            I'd presume Stata is referring to file 3 when it says "pro dist comm enum_area hh_ind hh_mem_id do not uniquely identify observations." To diagnose the problem, I'd suggest you open file3, and use the -duplicates- command to find out what observations do not have unique identifying information. Something like this should work:

            duplicates report pro dist comm enum_area hh_ind hh_mem_id

            One potentially tricky possibility that I have seen occurs if a file has several completely blank observations, usually at the end of the file. To Stata, such observations will have the same identifying information, i.e. the "." missing value code.
            Dear Mike,

            Thank you for your codes and suggestions. I have followed your instructions and seems it worked.

            Firstly, I used duplicates command to report duplicated observations and here are the results
            Code:
            duplicates report pro dist comm enum_area hh_ind hh_mem_id
            --------------------------------------
            copies | observations surplus
            ----------+---------------------------
            1 | 1778 0
            2 | 600 300
            3 | 144 96
            4 | 28 21
            5 | 5 4
            --------------------------------------

            And then, I looked at a duplicates list by using
            Code:
            duplicates list pro dist comm enum_area hh_ind hh_mem_id
            Finally, I dropped all duplicated observations by using
            Code:
            duplicates drop pro dist comm enum_area hh_ind hh_mem_id, force
            that resulted in 421 observations deleted. Following this, I did the same merging procedures that I mentioned in my first post and it worked. However, I have a minor question that should I use the command "keep if _merge==3" or I should not because I have seen someone advises not to do, therefore, I am quite confused. By the way, I just want to make sure that my merging procedures go to a right direction.

            Thank you.

            Best regards,

            Dung Le

            Comment


            • #7
              It's likely but not certain that "keep if _merge ==3" is what you want. Look at -help merge-, in particular the description of what the coding of the _merge variable means, and then decide on that basis if that's what you want.

              Regarding your decision to drop all duplicated observations: Whether or not that's a good idea would depend on why the observations happened to be duplicated, i.e., was it just an accident, or were there perhaps corrected versions of observations entered in the data set, etc.

              Comment

              Working...
              X