Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • m:m merge


    I may have found a use for m:m merging--I'm wondering if this is correct/if there is a simpler way of doing this.


    I have data on firms that made multiple different payments each year over a number of years, sometimes making no payments in a given year. There's about 800k observations, I'm using Stata 13.

    My data looks something like:


    Firm Year Payment
    1 2013 100
    1 2013 20
    1 2012 100
    2 2013 10
    2 2012 50
    3 2012 50
    4 2009 70
    4 2010 80
    ...


    I'm trying to generate variables that are yearly totals for each firm & that have that value for every observation for the firm, provided that it made a payment in that year.


    Firm Year Payment 2013total 2012total ...
    1 2013 100 120 100
    1 2013 20 120 100
    1 2012 100 120 100
    2 2013 10 10 50
    2 2012 50 10 50
    3 2012 50 . 50
    4 2009 70
    4 2010 80 . .
    ...


    Here's what I came up with:


    preserve

    keep if year==2013

    bysort Firm: egen total2013=total(payment)

    save "x.dta"

    restore

    merge m:m firm using "x.dta"

    save

    // rinse and repeat

    ...


    This seems to do the trick. Is this (finally) a necessary use of m:m? & Is there a simpler way to do this? This feels like the long way 'round.


    Thanks!



  • #2
    This is almost certainly not the way you want to do this. For a m:m merge to work the order of the observations with matching ID across data sets matters.
    A m:m merge will merge the first observation in the master data to the first matching observation in using data. It will continue to do this until it runs out of observations in one data set or the other. So in your situation you have your master data set that has a certain number of observations per firm. You'll have a using data set that has fewer observations per firm because it just has the 2013 data. So if you merge m:m you will attach the total2013 value to the first N observations for that firm, where N is the number of observations for that firm in 2013. The rest of the observations won't match.

    Comment


    • #3
      Aside from the many to many merge issues, this isn't a task that requires merging anything.

      There are lots of different ways to do this and probably some more elegant than this but here's one solution
      Code:
      **for the loop substitute in your actual year range
      forv y=2009/2013 {
          
          **create the total just within the observations of the relevant year
          bysort firm: egen total`y'x=total(payment) if year==`y'
          
          **now fill in the rest of the observations for that firm
          bysort firm: egen total`y'=max(total`y'x)
      
          **and get rid of the version of the variable that only appears in the appropriate year
          drop total`y'x
      }

      Comment


      • #4
        Has anybody ever come up with a good reason to do an m:m merge? Even the manual says "Because m:m merges are such a bad idea, we are not going to show you an example." Maybe if you want to use one Stata should force you to type in something like iREallYWanttOdoTh!s, getting the capitalization exactly right and all that.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          I can come up with examples where a m:m merge would work but I'm hard pressed to come up with an example where a m:m merge is actually a good idea. Basically you need two data sets that have multiple lines per observation that are always ordered in exactly the order you want to match them in. That setup will work with a m:m merge as long as you don't do anything else to the data before merging. However, I wouldn't ever use do a many to many merge with that type of data since the very first thing I would do upon opening such a dataset would be to create a variable recording the original observation order within the identifier to avoid screwing things up later by resorting. Then I'd just do a 1:1 merge using the original id and the created order variable.

          Comment


          • #6
            Thank you!! I knew there was a simpler way to do this, 'spreading' with the max() function works perfectly.

            For posterity's sake, m:m merging does work here--it produces the same results as Sarah's (better) method. The order of the matched observations doesn't matter since they're all getting the same value within each matched firm, and m:m will repeatedly use the last-used group observation if there are different numbers of observations.

            It is hard to come up with an example in which m:m is required--there's always a workaround.


            Comment


            • #7
              Interesting. For some reason I thought Stata didn't keep matching when it ran out of observations rather than just repeating the last-used observation.
              I thought I'd observed the phenomenon where the merge doesn't match the extra observations. I must be confusing this with what happens if you don't specify an id variable for the merge, in which case Stata does just matches one to one until it runs out of observations. I think I must have erroneously assumed the same thing happened within identifier in a many to many merge with different numbers of observations across the two sets.

              Comment


              • #8
                Sarah, that is exactly what I am getting with merge m:m. I can't see that sort of merge ever being useful in any situation. The solution is joinby

                As for many-to-many, there are numerous real world examples.

                T1 = Patient admissions
                T2 = Patient treatments (or diseases or consulted doctor)
                A treatment may take a number of admissions.
                An admission may be for a number of treatments.

                T1 = interventions
                T2 = participants
                An intervention may have multiple participants.
                A participant may take part in multiple interventions.

                There are ways to muddle around the maintenance of the many to many relationship, usually by summarising one of the tables into a field array in the other.
                Last edited by simon forsyth; 17 May 2016, 21:14.

                Comment

                Working...
                X