Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets

    Hi,

    I'm currently writing my thesis and I'm new to Stata. I want to combine a dataset from Compustat and a dataset from Factset, both in yearly terms. I have the following datasets:

    Compustat
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long gvkey str12 isin str10 datadate int year float lvchange
    1166 "NL0000334118" "31/12/2002" 2002  .006297663
    1166 "NL0000334118" "31/12/2003" 2003     .105081
    1166 "NL0000334118" "31/12/2004" 2004   .02746665
    1166 "NL0000334118" "31/12/2005" 2005  -.01903224
    1166 "NL0000334118" "31/12/2006" 2006  -.02919521
    1166 "NL0000334118" "31/12/2007" 2007  -.05277575
    1166 "NL0000334118" "31/12/2008" 2008  -.05401874
    1166 "NL0000334118" "31/12/2009" 2009     .123919
    1166 "NL0000334118" "31/12/2010" 2010  -.07997546
    1166 "NL0000334118" "31/12/2011" 2011 -.072212696
    1166 "NL0000334118" "31/12/2012" 2012  -.09492537
    1166 "NL0000334118" "31/12/2013" 2013  -.23273465
    1166 "NL0000334118" "31/12/2014" 2014  .007800169
    1166 "NL0000334118" "31/12/2015" 2015 -.013378848
    1166 "NL0000334118" "31/12/2016" 2016  .000170374
    1932 "GB0002875804" "31/12/1995" 1995           .
    1932 "GB0002875804" "31/12/1996" 1996  .007074844
    1932 "GB0002875804" "31/12/1997" 1997   .02022037
    1932 "GB0002875804" "31/12/1998" 1998   .07025388
    1932 "GB0002875804" "31/12/1999" 1999  -.27306455
    end
    FactSet
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str12 isin str61 name str12 ri int year byte month
    "AT0000730007" "ANDRITZ AG" "11.403597"  2002 12
    "AT0000730007" "ANDRITZ AG" "72.014046"  2003 12
    "AT0000730007" "ANDRITZ AG" "51.732697"  2004 12
    "AT0000730007" "ANDRITZ AG" "68.74873"   2005 12
    "AT0000730007" "ANDRITZ AG" "79.90005"   2006 12
    "AT0000730007" "ANDRITZ AG" "2.558291"   2007 12
    "AT0000730007" "ANDRITZ AG" "-54.92813"  2008 12
    "AT0000730007" "ANDRITZ AG" "133.72536"  2009 12
    "AT0000730007" "ANDRITZ AG" "73.54651"   2010 12
    "AT0000730007" "ANDRITZ AG" "-4.4100466" 2011 12
    "AT0000730007" "ANDRITZ AG" "56.09789"   2012 12
    "AT0000730007" "ANDRITZ AG" "-3.874004"  2013 12
    "AT0000730007" "ANDRITZ AG" "1.3872743"  2014 12
    "AT0000730007" "ANDRITZ AG" "0.37008524" 2015 12
    "AT0000730007" "ANDRITZ AG" "8.931768"   2016 12
    "AT0000743059" "OMV AG"     "3.7600994"  2002 12
    "AT0000743059" "OMV AG"     "30.54434"   2003 12
    "AT0000743059" "OMV AG"     "92.7689"    2004 12
    "AT0000743059" "OMV AG"     "127.135254" 2005 12
    "AT0000743059" "OMV AG"     "-11.443013" 2006 12
    end
    I would like to match both the isin and the year, so I get the lv change of each company for each year.

    To merge the, I tried the code "merge m:m isin year using fact, , keepusing(lvchange), but it gave me the error
    "variables isin year do not uniquely identify observations in the master data".

    Any help on how to fix this would be appreciated.

    Thanks

  • #2
    I have two comments about "merge m:m isin year using fact, , keepusing(lvchange)" The first is that it would never give you the error message "variables isin year do not uniquely identify observations in the master data." That's because an m:m merge would not care at all whether anything uniquely identifies observations in any data set. More important, you should not be using -merge m:m-. It produces a useless data salad. If you think you need to use -merge m:m- then one or more of three things is true: 1) you don't understand what -merge- does, 2) you don't understand your data, or 3) your data sets are messed up.

    In any case, based on the example data you show, you can properly combine these two data sets with:

    Code:
    use compustat_data_set, clear
    merge 1:1 isin year using factset_data_set
    because in your example data the combination of isin and year does uniquely identify observations in both data sets.

    That said, I see that the computstat data set also includes gvkey, and I suspect that in a fuller example from that data set, we would see multiple observations having the same values of isin and year, with different values of gvkey. So what I recommend you try is:
    Code:
    use compustat_data_set, clear
    merge m:1 isin year using factset_data_set
    Now, if you try that and you get an error message saying that isin and year do not uniquely identify observations in the using data set, then you need to figure out what is actually going on. The first question is whether the factset data set should have multiple observations with the same values of isin and year.

    Case 1: It should not. Then you need to find out why it does.
    Code:
    duplicates tag isin year, gen(flag)
    browse if flag
    will show them to you and then maybe you can figure out how those duplicates got there, and arrive at a way to correctly retain only the right ones, and fix whatever errors in data management led to the inclusion of the spurious observations.

    Case 2: It should. In that case, these data sets may not be mergeable at all. In order for them to be mergeable there must be some third variable (or perhaps more than one) common to both data sets that will, in combination with isin and year, uniquely identify observations in at least one of the two data sets. For example, if the factset data set also contains a gvkey variable, then including that in the -merge- command might resolve the problem. If there are, however, no such additional variables, then your data sets cannot be merged. Think about it. Suppose you have several observations in both data sets all with isin = AB123435678 and year = 2012. How would you decide on the proper way to pair them up? Which of these observations in the compustat data set goes with which one in the factset data set? If the answer is to create all possible pairings of such observations, -merge- is the wrong command: see -help joinby-. Or perhaps when looking more carefully at the data you can arrive at a rule for selecting specific pairs. In that case you might have to weed out unmatchable observations from one or both data sets so that you are then able to successfully use -merge m:1 isin year-. Or perhaps it is possible to create a new variable in each of the data sets that does correctly select those matches and you can -merge 1:1 isin year new_variable-.

    Comment


    • #3
      Hi,

      Thanks to you I found the problem, which was that some of the companies changed their annual reporting date. After that I used the code:
      Code:
      bysort gvkey (flag): drop if flag[_N]==1
      217 observations were deleted. I thought the problem was solved, but when I used both these codes I still got the error message "variables isin year do not uniquely identify observations in the using data":
      Code:
      merge 1:1 isin year using fact
      Code:
      merge m:1 isin year using fact
      By the way, the dataset should not have multiple observations with the same values of isin and year anymore.
      Last edited by Juan Gonzalex; 26 May 2022, 04:11.

      Comment

      Working...
      X