Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging (panel) datasets

    Hi,
    I am currently attempting to merge two datasets - panel data - both containing a unique ID (which occurs in both datasets) and 5 years of observations for each ID. For all possible merge options such ass 1:1, m:1 and 1:m I repeatedly get back that the "variables X Y do not uniquely identify observations in the master data". How can I successfully merge these two datasets? Thanks in advance!

  • #2
    without data examples (use -dataex- and post in CODE blocks - see the FAQ) it is not possible to give an exact answer; one guess is the you want -joinby- rather than -merge-; see
    Code:
    h joinby

    Comment


    • #3
      Welcome to Statalist.

      First, I'd suggest don't gamble with 1:1, m:1, and 1:m and see which one sticks. Instead, try to understand the data structure so that you have enough confidence to say that "yes, this data is supposed to contain one unique ID per row," or "yes, this data is supposed to have multiple of the same ID across rows." Without that knowledge, it's very hard to help you.

      With that said, you can examine where the duplicates are and flag those duplicates for examination. To check the extend of duplication:

      Code:
      duplicates report X Y
      To flag them for examination:

      Code:
      duplicates tag X Y, gen(my_flag_variable)
      * my_flag_variable is the name of the flag, you can rename whatever fits
      There are also other types of duplicates command as well (including one that allows you to delete duplicates). To learn more, use:

      Code:
      help duplicates
      Lastly, a more obscure reason is having multiple rows of missing values. This often happens when importing data from Excel which might come with some empty rows. Because missing "." is considered a value, having multiple of them in your matching variables will also cause the ID to be not unique.

      Comment


      • #4
        Thank you very much for the quick responses! I am fairly new to Statalist and thus still figuring out how it all exactly works. Regarding the two datasets in Stata that I am trying to merge:

        The first dataset contains average residential property values for a given municipality (based on their municipality code) for the years 2015 - 2020.
        The second dataset contains the number of jobs for a given municipality (based on the municipality code) for the years 2015 - 2020.

        Both datasets therefore have the following 'look':
        Municipality code | Year | Number
        X | 2015 | Number
        X | 2016 | Number
        X | 2017 | Number, etc.
        Y | 2015 | Number
        Y | 2016 | Number
        Y | 2017 | Number, etc.

        I am aiming to have the datasets merged in such a way that I will obtain the following:
        Municipality code | Year | Average residential property value | Jobs
        X | 2015 | Average residential property value | Jobs
        X | 2016 | Average residential property value | Jobs
        X | 2017 | Average residential property value | Jobs, etc.
        Y | 2015 | Average residential property value | Jobs
        Y | 2016 | Average residential property value | Jobs
        Y | 2017 | Average residential property value | Jobs, etc.

        I have managed to combine the two datasets with the following commands, however, the dataset I then obtain lists all the years vertically below each other which causes many duplicates.
        Code:
        use "file1", clear
        sort municipality_code year
        save "file1", replace
        use "file2", clear
        sort municipality_code year
        merge municipality_code year using "file1"
        drop _merge
        Am hoping that this might provide you with some more insights on the data. Thanks a lot for the help!

        Comment


        • #5
          The following example code may start you in a useful direction.

          Code:
          clear all
          cls
          // create example datasets
          
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str2 code int year str13 x
          "X" 2015 "Number"      
          "X" 2016 "Number"      
          "X" 2017 "Number, etc."
          "Y" 2015 "Number"      
          "Y" 2016 "Number"      
          "Y" 2017 "Number, etc."
          end
          save "file1", replace
          
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str2 code int year str36 y str11 z
          "X" 2015 "Average value" "Jobs"      
          "X" 2016 "Average value" "Jobs"      
          "X" 2017 "Average value" "Jobs, etc."
          "Y" 2015 "Average value" "Jobs"      
          "Y" 2016 "Average value" "Jobs"      
          "Y" 2017 "Average value" "Jobs, etc."
          end
          save "file2", replace
          
          // check that identifiers are distinct in each dataset
          
          use file1, clear
          duplicates report code year
          use file2, clear
          duplicates report code year
          
          // merge two datasets
          
          use file1, clear
          merge 1:1 code year using "file2"
          sort code year
          list, sepby(code)
          Code:
          . // check that identifiers are distinct in each dataset
          . 
          . use file1, clear
          
          . duplicates report code year
          
          Duplicates in terms of code year
          
          --------------------------------------
             Copies | Observations       Surplus
          ----------+---------------------------
                  1 |            6             0
          --------------------------------------
          
          . use file2, clear
          
          . duplicates report code year
          
          Duplicates in terms of code year
          
          --------------------------------------
             Copies | Observations       Surplus
          ----------+---------------------------
                  1 |            6             0
          --------------------------------------
          
          . 
          . // merge two datasets
          . 
          . use file1, clear
          
          . merge 1:1 code year using "file2"
          
              Result                      Number of obs
              -----------------------------------------
              Not matched                             0
              Matched                                 6  (_merge==3)
              -----------------------------------------
          
          . sort code year
          
          . list, sepby(code)
          
               +-----------------------------------------------------------------------+
               | code   year              x               y            z        _merge |
               |-----------------------------------------------------------------------|
            1. |    X   2015         Number   Average value         Jobs   Matched (3) |
            2. |    X   2016         Number   Average value         Jobs   Matched (3) |
            3. |    X   2017   Number, etc.   Average value   Jobs, etc.   Matched (3) |
               |-----------------------------------------------------------------------|
            4. |    Y   2015         Number   Average value         Jobs   Matched (3) |
            5. |    Y   2016         Number   Average value         Jobs   Matched (3) |
            6. |    Y   2017   Number, etc.   Average value   Jobs, etc.   Matched (3) |
               +-----------------------------------------------------------------------+
          
          .

          Comment

          Working...
          X