Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Merge Error: Merge Datasets by uniqueid (datasets different size)

    Hello All,

    I am trying to merge two data sets by id. The first is a long data set, containing multiple instances per id, this has 27,384 observations. The second data set contains only one instance of each id with 1,697 observations. Each data set contains different data, the long dataset contains school enrollment information, the second contains student characteristics like class rank, gpa, sex, etc. I tried
    Code:
    merge 1:m randomid using YUC_Stu_Characteristics.dta
    and receive error: variable randomid does not uniquely identify observations in the master data

    My unique identifier is randomid. I have already verified that each of the 1,697 ids in the shorter data set appear in the long data set.

    Data set 1 (27,384 observations)
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double randomid long high_school_grad_date str7 publicprivate str17 year4year str2 college_state
    2103300231 20220608 "Private" "4-year" "CA"
    2103300237 20220608 ""        ""       ""  
    2103300243 20220608 ""        ""       ""  
    2103300255 20200603 "Private" "4-year" "CA"
    2103300255 20200603 "Private" "4-year" "CA"
    2103300255 20200603 "Private" "4-year" "CA"
    2103300255 20200603 "Public"  "2-year" "CA"
    2103300255 20200603 "Private" "4-year" "CA"
    2103300255 20200603 "Public"  "2-year" "CA"
    2103300255 20200603 "Private" "4-year" "CA"
    end
    Dataset 2 (1,697 observations)

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double randomid int classrank str1 sex
    3342021132  87 "M"
    4383032232  81 "M"
    4719000696  20 "M"
    4716004272  39 "M"
    end
    I do not want to drop any of the repeated ids since each line contains different data that is needed for my analysis. I am wondering how to merge these two datasets or if perhaps it needs to be reshaped first? Help is greatly appreciated

  • #2
    The second data set contains only one instance of each id with 1,697 observations.
    But Stata says that's not true, and I have never known Stata to be wrong about this.

    It is not clear from your code which data set is which here. But if the data set that you believe has only one observation per id is the one namedYUC_Stu_Characteristics.dta, then the problem is that you should have coded -merge m:1 ...- not 1:m.

    However, if that's not the case, then your data are not as you think they are. You need to find the instances where the same id appears more than once in that data set:
    Code:
    duplicates tag randomid, gen(flag)
    browse if flag
    will show them to you.

    Once you find them, you then have to figure out what to do about them. Given the context and description of your data sets, I think that those extra observations should not be there. Consequently they would be data errors and you have to fix them. This means you have to decide which among the duplicates are the correct ones (if any) and drop the rest. Or perhaps you have multiple observations for the same id, each with incomplete data, and you need to combine them in some way. There are many possibilities, and you will just have to see what you have and figure out what is wrong.

    But I encourage you not to just fix the duplicate observations you find. Their presence implies that something went wrong in the creation of this data set. So you need to revisit the data management process that led up to it to find out how the additional observations crept in. In doing that, you may find other errors as well, and you should fix them to, even if they have not caused you any trouble yet.
    Last edited by Clyde Schechter; 27 Mar 2023, 17:56.

    Comment


    • #3
      Clyde Schechter Thanks for your response. Perhaps some clarification is needed on my behalf. The larger dataset contains more students than I have student characteristics data. I only have characteristics for 1,697 students, which are all part of the larger dataset. The same id does appear more than once in the large dataset but not in the smaller one.

      I am hesitant to drop the duplicates from the large dataset because each instance contains different data that is needed for my analysis.

      Comment


      • #4
        I am not recommending you drop duplicates from the large data set. Only from the small one, if there are any there (there shouldn't be, but your data set may not be what you think it is). I can't tell if that is the problem, or if you just used -1:m- where you should have used -m:1.
        "
        First try changing 1:m to m:1. I think there is a good chance this is all you need to do here. (Do not, under any circumstances, try m:m--that will give you no error messages but it will make a data salad out of your two data sets and everything you run from that point on will be wrong.) But if that doesn't solve the problem, then you have to deal with the duplicates that shouldn't exist in the second data set but are actually there.

        Comment


        • #5
          Clyde Schechter you were right. I found 48 instances of duplicates in my small dataset. Could you help with a code to drop only the duplicates? I do not want to accidentally drop the wrong data

          Comment


          • #6
            Code:
            h duplicates
            you will note that there is an option "force" - do not use it - if there are some, apparently immaterial or irrelevant differences, post back with a -dataex- example and what you think should happen

            Comment

            Working...
            X