Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems merging data

    Hi all,

    I have been trying to merge my data using key variables but error I keep getting is : " key variable do not uniquely identify observations in the using data"

    The is my first data
    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 id float(date days) byte measurement float date1 int(sugar ice smoke sugar2 ice1 smoke1 ice2 cie3) float(number date4)
    "44" 21213 130 12 21327 120 70   .  . . . . . 18 21369
    "44" 21213 130  8 21290 112 64   .  . . . . . 18 21369
    "30" 21213 130  2 21329 140 85   .  . . . . . 19 21369
    "30" 21213 130 10 21306 120 70   .  . . . . . 18 21369
    "30" 21213 130  4 21251 126 62   .  . . . . . 18 21369
    "30" 21213 130  6 21264 110 60   .  . . . . . 18 21369
    "30" 21213 144  5 21255 133 62   .  . . . . . 18 21369
    ""       . 144  3 21241 108 70   .  . . . . . 18 21369
    "30" 21213 130 11 21318 130 70 130 60 . . . . 18 21369
    "30" 21213 144  2 21235 124 76   .  . . . . . 18 21369
    "30" 21213 130  1 21227 118 62   .  . . . . . 18 21369
    "30" 21213 130  7 21269 120 58   .  . . . . . 18 21369
    "44" 21213 130  1 21284 105 55   .  . . . . . 19 21369
    "44" 21213 130  9 21297 124 60   .  . . . . . 18 21369
    "44" 21213 130  1 21192 112 64   .  . . . . . 17 21369
    end
    format %tdDD/NN/CCYY date
    format %td date1
    format %tdDD/NN/CCYY date4

    This is the second data:

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 id int(read1 read2) float(time self days value exp1 obse1)
    "419" 110 58             . 1 1 . . .
    "3"   120 46 1.8054964e+12 1 4 . . .
    "39"   81 44 1.8510804e+12 1 3 . . .
    "39"   84 50  1.724345e+12 1 3 . . .
    "39"   84 50 1.8505718e+12 1 3 . . .
    "39"  109 45             . 1 3 . . .
    "39"  143 50  1.850501e+12 1 3 . . .
    "39"  153 50 1.8138373e+12 1 4 . . .
    "373" 120 48 1.8326545e+12 1 5 . . .
    "373" 123 59 1.8765867e+12 1 5 . . .
    "373" 120 58   1.85383e+12 1 5 . . .
    "87"  120 50 1.8529647e+12 1 5 . . .
    "87"  100 50 1.8534015e+12 1 5 . . .
    "87"   89 50 1.7746176e+12 1 5 . . .
    "87"    9 50 1.8555128e+12 1 5 . . .
    "26"  121 48 1.7681596e+12 1 5 . . .
    end
    format %tcDD/NN/CCYY_HH:MM time
    format %tdDD/NN/CCYY value
    ------------------ copy up to and including the previous line ------------------

    Listed 16 out of 16 observations

    Any help would be greatly appreaciated

  • #2
    The error means what it says, the variables used for merging is not unique in the data set. The question did not show the actual merging command or did not describe the actual plan of the merging so it's hard to tell what went wrong. However, assuming you've attempted a 1:1 merge using id as the matching variable, then it would be caused by the fact that in both data there are multiple copies of the same id.

    Again, without knowing the actual intention, I cannot suggest anything. But there is also another command -joinby- that can do multiple merge (if that was the original intention.)

    Comment


    • #3
      Thanks for your reply. I think there are multiple copies of the same id because the data is currently in the long format. But what I intend to do is to add some variables that are missing in the first dataset from the second one. So both data sets are in the long format and have the same id but some variable e.g reading1 from the second data set I'd like to add to the first one

      Comment


      • #4
        Originally posted by Eli Omo View Post
        Thanks for your reply. I think there are multiple copies of the same id because the data is currently in the long format. But what I intend to do is to add some variables that are missing in the first dataset from the second one. So both data sets are in the long format and have the same id but some variable e.g reading1 from the second data set I'd like to add to the first one
        Then that would depend on of the variables you plan to add vary within an ID or constant within an ID. If it's constant, then you can check out -collapse- to compress the supplemental data set into one ID a line, and then use m:1 merge to combine them.

        If the variable you plan to add vary within ID, then it'd require you to identify additional variable (e.g. id number + year, etc.) so that it's unique enough for the collapsed data to be matched.

        Comment


        • #5
          The variable I want to add varies with ID. Would you be able to assist me with the code to collapse the supplemental data?

          Comment


          • #6
            Originally posted by Eli Omo View Post
            The variable I want to add varies with ID. Would you be able to assist me with the code to collapse the supplemental data?
            I'd like to help but I (and many others here) cannot render any help unless the question is clear. Please take a moment to read the FAQ (http://www.statalist.org/forums/help) on how to ask a good question. Basically, you'll need to:

            1) Clearly describe what you're trying to achieve, instead of what didn't work.
            2) Use -dataex- to show example. Which you did, but your examples are not useful because none of the ID exists in both data sets, so we will not be able even try our code to merge them. Try to use something like:
            Code:
            dataex var1 var2 var3 if inlist(id, "30", "44", "39", "87")
            so that both data sets would have the same people.
            3) Try to word the question with more information. For example, "I have this data 1 and this data 2. I'd like to add variable ABC from data 2 to data 1, matched by ID and XYZ."
            4) Show the code you used, if any.
            Last edited by Ken Chui; 13 Oct 2021, 14:37.

            Comment


            • #7
              Hi thank you very much for your help. The issue is that I am currently using fake data but I will try again to explain what I mean. So I have to twists of data with the common variable id (which idenitifies the person number) but there are some variables e.g "value " regarding " id" that are only in the second dataset but cannot be found in the first dataset. Currently as you can see above each "id" has multiple lines ( wide data) but I want to collapse each datasets so that each id only contains one line and then merge/join( unsure which one is suitable ) both datasets to have all information about the "id" in one dataset. I hope this is clearer. And once again thanks for all your help

              Comment


              • #8
                Originally posted by Eli Omo View Post
                Hi thank you very much for your help. The issue is that I am currently using fake data but I will try again to explain what I mean. So I have to twists of data with the common variable id (which idenitifies the person number) but there are some variables e.g "value " regarding " id" that are only in the second dataset but cannot be found in the first dataset. Currently as you can see above each "id" has multiple lines ( wide data) but I want to collapse each datasets so that each id only contains one line and then merge/join( unsure which one is suitable ) both datasets to have all information about the "id" in one dataset. I hope this is clearer. And once again thanks for all your help
                No, it's not clear. Because in the second data set all data under "value" is missing. It's fine to demonstrate with fake data, but do make sure they make sense.

                I'm going to use another example here to show how to do that if the information to be merged does not change within each id number:

                Code:
                clear
                input id x1
                1 1
                1 1
                2 1
                2 1
                3 1
                3 1
                end
                
                save data1, replace
                
                clear
                input id value
                1 17
                1 17
                2 19
                2 19
                3 21
                3 21
                end
                
                * If the information is constant within each id (e.g. id 1 is all 17, id 2 is all 19...)
                collapse (mean) value, by(id)
                merge 1:m id using data1
                
                list
                If the information to be merged also vary inside each id (aka, there are many values within id = 1), then this method will not work. And again, I have no idea what the truth is.

                Comment


                • #9
                  Thanks Ken, the information to be merged also vary inside each.

                  Comment


                  • #10
                    Originally posted by Eli Omo View Post
                    Thanks Ken, the information to be merged also vary inside each.
                    Then at least two possible next steps:

                    1) Is there a second, third, etc. variable that if you keep focusing in, will eventually see the "value" becoming constant? For example, here we added "year", while "value" is not the same within id, it's the same within id-year, so we can aggregate them:

                    Code:
                    clear
                    input id x1 year
                    1 1 2020
                    1 1 2021
                    1 1 2020
                    1 1 2021
                    2 1 2020
                    2 1 2021
                    2 1 2020
                    2 1 2021
                    end
                    
                    save data1, replace
                    
                    clear
                    input id value year
                    1 17 2020
                    1 17 2021
                    2 19 2020
                    2 19 2021
                    end
                    
                    * If the information is constant within each id (e.g. id 1 is all 17, id 2 is all 19...)
                    collapse (mean) value, by(id year)
                    merge 1:m id year using data1
                    
                    list
                    2) Another way is to consider extracting one number from this different "values" for each id (could be their mean, min, max, first, last, etc.) and then merge those back to the first data.

                    Either way, collapse and merge would probably be the major commands. Check out:
                    Code:
                    help merge
                    help collapse
                    to see how they work. Click on the PDF link to see the PDF version, in which you may find more examples.

                    Comment


                    • #11
                      Thanks I see what you mean I think I will have a detailed look at the merge and collapse commands

                      Comment

                      Working...
                      X