Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge with Update Replace

    Dear all,

    I use stata 13. I need do a merge between two datasets with common variables tsymbol cusip fyear. My objective is to replace missing values of cusip in the master(dataone) with the non-missing cusip in the using data (datatwo).
    ~~~~~~~~~~~~
    My code is:
    Use dataone
    Merge 1:1 tsymbol fyear using datatwo, update replace
    ~~~~~~~~~~~~
    One issue: Tsymbol datatype is str8 in the master(dataone) and str5 in the using data (datatwo).
    Question:
    1. Should I change the string length
    2. Should I revise my code to accomplish my goal?
    Below is pretend data (real data is much bigger)
    ~~~~~~~~~~~~
    dataone
    Input str8 Tsymbol fyear str9 cusip for_co
    A 2000 . 3
    A 2001 . 5
    A 2002 . 7
    A 2003 . 6
    A 2004 . 10
    AA 2000 1381710 1
    AA 2001 1381710 1
    AA 2004 1381710 1
    AAC.OLD 1995 . 1
    AACC 2005 . 2
    AACC 2007 . 3
    ~~~~~~~~~~~~
    ~~~~~~~~~~~~
    Datatwo
    Input str6 gvkey fyear str5 tysbol str9 cusip
    126554 1999 A 00846U101
    126554 2000 A 00846U101
    126554 2001 A 00846U101
    126554 2002 A 00846U101
    126554 2003 A 00846U101
    126554 2004 A 00846U101
    126554 2005 A 00846U101
    157058 2001 AACC 04543P100
    157058 2002 AACC 04543P100
    157058 2003 AACC 04543P100
    157058 2004 AACC 04543P100
    157058 2005 AACC 04543P100
    157058 2006 AACC 04543P100
    157058 2007 AACC 04543P100
    157058 2008 AACC 04543P100
    157058 2009 AACC 04543P100

    I also attached the data in excel.

    thanks in advance,

    Rochelle

    Attached Files

  • #2
    Rochelle,

    Merge will adjust the string length as necessary to accommodate the larger of the two strings. The only time you have to worry about such things is if one variable is string and the other is numeric.

    Regards,
    Joe

    Comment


    • #3
      So, a couple of things.

      First, you have a variable Tsymbol in dataone, but the nearest equivalent in datatwo is tysbol. For your merge to work as intended, the match variables must be identical in both data sets. (And variable names in Stata are case sensitive, the merge will only work if they agree exactly.)

      Second, why is Tsymbol 8 characters in dataone but only five characters in datatwo? If the variable in dataone is padded with blanks, and the one in datatwo is not, then the merge will not work as you expect. Probably you need to do something like replace Tsymbol = trim(itrim(Tsymbol)) in dataone. That one of the variables is typed str8 and the other str5 is not, itself, important, but to the extent that it suggests that the actual content of the variables will be different, it is a problem you need to fix before merging.

      Third, if you use both the -update- and -replace- options in your merge statement, you will clobber any pre-existing non-missing values of cusip in dataone. If you think some of those cusip values in dataone are incorrect and the ones in datatwo are appropriate corrections, then this is fine. But if you want to leave the existing values of cusip in dataone as they are, then you should use only the -update- option.

      Finally, a few general cautions about merging. I don't know if the variables you are showing us here represent the full dataset or only the ones you are focusing on. But if you are doing a merge with -update-, and all the more so if you are doing -replace-, you want to also add the -keepusing- option to your -merge- command so that you don't inadvertently replace values of other variables that the data sets have in common. (I know you said that tsymbol, cusip, and year are variables the data sets have in common, but you don't say that those are the only variables in common.) And, if datatwo does contain other variables that are in dataone, and if they should be the same in both data sets, then a safer merge would include those other variables in the -merge- statement's varlist. That way if there are discrepancies, you will find out about them.

      Comment


      • #4
        Rochelle,

        While my advice was technically correct, I wasn't paying attention to the fact that the merge variable was the one with two different string lengths. Clyde gives good advice about problems that could arise with such a merge. Aside from the spelling of "Tsymbol/tsybol" these problems will not show up as errors but may produce unexpected results.

        If you want more details on what the replace and update options are actually doing, you can check out vmerge (from SSC), which has the same syntax as merge (plus a few more options).

        Regards,
        Joe

        Comment


        • #5
          Many thanks to Joe and Clyde !!!

          some clarifications:
          1. tsybol was a typo for the use data. Both data should have Tsymbol and fyear
          2. Concerning Clyde's question about different length for Tsymbol variable: data one and two come from different data vendors, I am trying to match them to the best of my ability.
          3. my goal is to fill in the missing cusip in the master data by using the information from use data, I do not plan to replace non-missing cusip in the master.
          I ran the following code- with only update option , but I did not get cusip filled in for tsymbol ="A" for year 2000, 2001, 2002, 2003, 2004 . Could you let me know why?

          clear all

          * dataone
          input str9 tsymbol fyear str9 cusip for_co

          A 2000 . 3
          A 2001 . 5
          A 2002 . 7
          A 2003 . 6
          A 2004 . 10
          AA 2000 1381710 1
          AA 2001 1381710 1
          AA 2004 1381710 1
          AAC.OLD 1995 . 1
          AACC 2005 . 2
          AACC 2007 . 3


          end

          sort tsymbol fyear

          save mas,replace

          * datatwo
          clear all


          input str6 gvkey fyear str5 tsymbol str9 cusip
          126554 1999 A 00846U101
          126554 2000 A 00846U101
          126554 2001 A 00846U101
          126554 2002 A 00846U101
          126554 2003 A 00846U101
          126554 2004 A 00846U101
          126554 2005 A 00846U101
          157058 2001 AACC 04543P100
          157058 2002 AACC 04543P100
          157058 2003 AACC 04543P100
          157058 2004 AACC 04543P100
          157058 2005 AACC 04543P100
          157058 2006 AACC 04543P100
          157058 2007 AACC 04543P100
          157058 2008 AACC 04543P100
          157058 2009 AACC 04543P100

          end

          sort tsymbol fyear
          save use1,replace

          use mas, clear
          merge 1:1 tsymbol fyear using use1, update



          Best,
          -Rochelle

          Comment


          • #6
            Rochelle,

            Thanks for posting the exact code you used. The problem (with this example anyway) is that cusip is defined as a string, so the period(.) is an actual value not a missing value (and as such will not be updated from the using data set). One solution is to add replace cusip="" if cusip=="." before saving the first data set. A more robust solution would be to figure out how to input missing cusip values as a null string instead of a period. However, that will require a different approach to data input (either different delimiters or fixed column widths).

            Regards,
            Joe

            Comment


            • #7
              Thanks Joe for helping me . I appreciate it very much.

              Comment

              Working...
              X