Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge datasets; r(459); variable does not uniquely identify observations in the using data

    Hello everyone,

    I just startet to use stata and have a problem with merging 3 different datasets (with Stata 12).

    I used the merge (m:1) command and got an error message r(459) "variable does not uniquely identify observations in the using data".

    merge m:1 company_ID using Set 2.dta

    merge m:1 company_ID using Set 3.dta

    The Datasets look like this:

    Set 1 (master):

    company_ID job_request_date graduate_id
    123 12.11.2014 57878
    123 12.10.2014 78878
    123 16.11.2014 99899
    121 14.11.2014 55744
    345 12.10.2014 55879
    876 12.09.2014 55879
    876 19.09.2014 14787
    1000 19.09.2014 14787 (--> not available in Set 2)
    . (missing) . (missing) 68994 (--> no job offer --> no company_ID)
    ... .....

    Set 1 contains multiple obersevations for one company.


    Set 2:

    company_ID number_employees
    123 100
    121 50
    345 600
    876 800
    ... .....

    Set 2 contains one observation for each company.


    Set 3:

    company_ID export_number
    123 1
    121 1
    345 5
    876 6
    1000 1
    ... .....

    Set 3 contains one observation for each company. Not every Company_ID of Set 3 is included in Set 2.


    I want to add the information of Set 2 and Set 3 for each observation in Set 1:

    Set merged:

    company_ID job_request_date graduate_id number_employees export_number
    123 12.11.2014 57878 100 1
    123 12.10.2014 78878 100 1
    123 16.11.2014 99899 100 1
    121 14.11.2014 55744 50 1
    345 12.10.2014 55879 600 5
    876 12.09.2014 55879 800 6
    876 19.09.2014 14787 800 6
    1000 19.09.2014 14787 . (missing) 1
    ... .....

    It is possible that the company_ID and the graduate_id are the same but since I define the company_ID as the keyvariable there should not be a problem?

    I think that it might be a problem that the different sets contain company_IDs that are not in all ofthe other datasets?! I only know how to merge Data wih exel "vlookup". Add information if you find a matching pair f.e. company_ID = 123 in both files. Does it work the same with the merge command?


    I hope you can help me with this problem.
    Thank you!
    Last edited by Jessica Berger; 22 Nov 2015, 01:28.

  • #2
    I suggest using duplicates to check that should be true is indeed so.

    Comment


    • #3
      Jessica, if the example above represents your real data, you can try this code.

      Code:
      clear*
      input company_ID str10 job_request_date graduate_id
      123 "12.11.2014" 57878
      123 "12.10.2014" 78878
      123 "16.11.2014" 99899
      121 "14.11.2014" 55744
      345 "12.10.2014" 55879
      876 "12.09.2014" 55879
      876 "19.09.2014" 14787
      1000 "19.09.2014" 14787
      end
      //SAVE THE MASTER FILE
      tempfile main
      save `main'
      
      clear
      input company_ID number_employees
      123 100
      121 50
      345 600
      876 800
      end
      //SAVE THE FIRST USING FILE
      
      tempfile set2
      save `set2'
      
      clear
      input company_ID export_number
      123 1
      121 1
      345 5
      876 6
      1000 1
      end
      //SAVE THE SECOND USING FILE
      
      tempfile set3
      save `set3'
      //MERGE THE FIRST AND THE SECOND USING WITH THE MASTER FILE
      use `main', clear
      merge m:1 company_ID using `set2', keep(1 3) nogen
      merge m:1 company_ID using `set3', keep(1 3) nogen
      list ​
      I've just read Nick's comment and indeed you should check for duplicates in the using files (set2 set3).
      Last edited by Oded Mcdossi; 22 Nov 2015, 03:11.

      Comment


      • #4
        You can also use the
        Code:
         isid
        command to check your assumptions about the primary key in a file.

        Comment


        • #5
          Thank you all for your advice! I checked for duplicates in my datasets, but there are none. I managed to merge set 1 and set 2 with the command merge m:m company_id using x.dta

          I tried to do it the same way with set 3 but I get an error: r(909)

          op. sys. refuses to provide memory
          Stata's data-storage memory manager has already allocated 1760m bytes and it just attempted to allocate another
          16m bytes. The operating system said no. Perhaps you are running another memory-consuming task and the command
          will work later when the task completes. Perhaps you are on a multiuser system that is especially busy and the
          command will work later when activity quiets down. Perhaps a system administrator has put a limit on what you
          can allocate; see help memory. Or perhaps that's all the memory your computer can allocate to Stata.

          The merged dataset (set1_set2) has the size of 1,3 GB and set 3 has the size of 40 MB. I already tried to shut down other running programms; to restart my computer (which I bought just a few month ago). I also did run the compress command. Any thoughts how to solve this problem?!

          Thank you!



          Comment


          • #6
            I would like to reiterate my plea to the powers to be at StataCorp that the many-to-many merge syntax be eliminated.

            Jessica stated in #1

            Set 2 contains one observation for each company.
            Set 3 contains one observation for each company.
            But did not quite connect the dots when Stata complained that the following command

            Code:
            merge m:1 company_ID using Set 2.dta
            yielded an error message that would look like

            Code:
            variable company_ID does not uniquely identify observations in the using data
            r(459);
            She choose to ignore the good advice she received in this thread and decided to resolve the issue using a many-to-many merge.

            Here's an example where company_ID 123 has two observations in Set 2.

            Code:
            clear
            input company_ID number_employees
            123 100
            123 999
            121 50
            345 600
            876 800
            end
            save "Set 2.dta", replace
            
            clear
            input company_ID str10 job_request_date graduate_id
            123 "12.11.2014" 57878
            123 "12.10.2014" 78878
            123 "16.11.2014" 99899
            121 "14.11.2014" 55744
            345 "12.10.2014" 55879
            876 "12.09.2014" 55879
            876 "19.09.2014" 14787
            1000 "19.09.2014" 14787
            end
            
            merge m:m company_ID using "Set 2.dta"
            
            isid company_ID job_request_date, sort
            list if company_ID == 123
            Each time you run the code above, you are likely to get a different match. For example:

            Code:
                 +-----------------------------------------------------------+
                 | compan~D   job_requ~e   gradua~d   number~s        _merge |
                 |-----------------------------------------------------------|
              2. |      123   12.10.2014      78878        100   matched (3) |
              3. |      123   12.11.2014      57878        999   matched (3) |
              4. |      123   16.11.2014      99899        999   matched (3) |
                 +-----------------------------------------------------------+
            and
            Code:
                 +-----------------------------------------------------------+
                 | compan~D   job_requ~e   gradua~d   number~s        _merge |
                 |-----------------------------------------------------------|
              2. |      123   12.10.2014      78878        100   matched (3) |
              3. |      123   12.11.2014      57878        999   matched (3) |
              4. |      123   16.11.2014      99899        100   matched (3) |
                 +-----------------------------------------------------------+
            The many-to-many merge issue resurfaces regularly on Statalist. This shows that as long as the option remains available, some users will convince themselves that it is a good option for them, notwithstanding all the words of caution in the official documentation. I fear there are far more users who go many-to-many that we ever hear about on this list.

            Comment


            • #7
              +1 to Robert Picard's comments in #6.

              Jessica should go back to her original data sets. While she believes that there is just one observation per company_ID in data sets 2 and 3, the -merge- command is telling her otherwise.

              Code:
              use dataset2, clear
              duplicates list company_ID
              
              use dataset3, clear
              duplicates list company_ID

              Comment


              • #8
                Thank you for your replies...I did not realize that the m:m command will lead to a false match.

                I finally managed to merge the datasets!
                The problem was, that the datasets had missing values, that were counted as duplicates, which I didn't realize when I first looked for duplicates. After dropping the missing values in set 2 and set 3 the merge command m:1 worked perfectly.

                Thank you for your help!

                Comment

                Working...
                X