Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining two datasets: different results each time

    Hi everyone,

    I've been searching a lot around the internet but I haven't found how to solve my problem.

    I have two datasets. Each dataset has a list of companies with balance sheet results for a given year. Each company is also part of a certain industry, which is identified with a "SIC" code. I need to combine both datasets together.

    Hence, I've created an ID code (called sicyear), which combines the SIC code with the year in question. For instance if a company has a SIC code that equals 8000 and the year in question is 2006, then the ID code would be 80002006. These ID codes are not unique.

    My problem is that when I run my to do file I always get a different number of arrays for the combined data. And I just can't understand why.

    Here are the five first lines from my first dataset:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 gvkey double fyear str10 cusip str58 conm double(ceq csho prcc_f) str4 sic float(MV_equity BtMkt startyear) str8 sicyear
    "014790" 1998 "59001A102" "MERITAGE HOMES CORP"            72.279    5.335     12.1875 "1531"  65.02031  1.1116372 1998 "15311998"
    "142953" 2001 "609207105" "MONDELEZ INTERNATIONAL INC"      23478     1735       34.03 "2000"  59042.05   .3976488 2001 "20002001"
    "029942" 2004 "751028101" "RALCORP HOLDINGS INC"            444.2   29.389        36.1 "2000" 1060.9429   .4186842 2004 "20002004"
    "009777" 2003 "832696405" "SMUCKER (JM) CO"              1210.693   50.175        52.3 "2033" 2624.1526   .4613653 2003 "20332003"
    "063447" 2000 "401617105" "GUESS INC"                     175.156   43.563      5.3125 "2330" 231.42844   .7568473 2000 "23302000"
    Here are the five first lines from my second dataset:

    Code:
    * Example generated by -dataex-. To install:    ssc install    dataex
    clear
    input str6 gvkey2 double fyear2 str10 cusip2    str58 conm2    double(ceq2    csho2 prcc_f2)    str4 sic2 float(MV_equity2 BtMkt2)    str8 sicyear
    "010802" 1980 "902740109" "UI GROUP INC"        48.954    2.236    14.625 "0100"      32.7015    1.4969956    "01001980"
    "001266" 1981 "016230104" "ALICO INC"        15.709    1.887    48.5 "0100"      91.5195    .17164648    "01001981"
    "002099" 1981 "073799108" "BEAR CREEK CORP"        28.162    1.9    15.25 "0100"       28.975    .9719414    "01001981"
    "002812" 1981 "256603101" "DOLE FOOD CO INC"        442.415    25.534    12.74999683 "0100"     325.5584    1.358942    "01001981"
    "005416" 1981 "404294100" "HS GROUP INC"        3.819    .293    11.5 "0100"       3.3695    1.1334026    "01001981"
    Does anyone have any idea why my results are always different when I rerun the do file?

    Thank you very much for any help you can provide!

    Aurele

  • #2
    the FAQ requests that you show us exactly what you typed - without seeing your do file, the best we can do is guess; so, please post the do file (within code blocks)

    Comment


    • #3
      Hello Aurele,

      Rich provided the best advice so as to get insightful replies.

      That said, I wonder whether you really chose - sort - before merging or appending so as you could sort each dataset the same way.

      Best,

      Marcos
      Best regards,

      Marcos

      Comment


      • #4
        As both Marcos and Rich Goldstein point out, you are asking us to troubleshoot some code that we can only imagine, because you haven't shown it.

        That said, just looking at your data set and doing some speculative reading between the lines of your description, it sounds like you are trying to merge the data keying on the variable sicyear. The problem with that is that sicyear is not going to uniquely identify observations in either of your data sets. So if you are using -merge- you are probably tempted into using -merge m:m-, which, unfortunately, is almost guaranteed to produce irreproducible gibberish results. Even if that were not the case, would you really want to make a data set that, in the same observation, contains data from one firm in a given year for some variables, and data from a different firm in a given year for the other variables, the only constraint being that the two firms are part of the same industry. Admittedly there are some circumstances where something like this makes sense*, but they are uncommon and I'm guessing you are not in one of those situations.

        Much more typical is the situation where you want to combine the data from file 1 with data from the same firms in the same years in file 2. If that's what you want, then you don't need your sicyear variable for anything, you can drop it. But what you do need is to have the same names for the firm id and the year in both. So something like this:

        Code:
        use data_set_1, clear
        rename gvkey1 gvkey2
        rename fyear1 fyear2
        merge 1:1 gvkey2 fyear2 using data_set_2
        *The "something like this" that sometimes is useful is to pair each firm-year's variables in the first data set with every firm-year's variables representing the same industry. If that's what you mean by combining the data sets, then the -merge- command is the wrong command, you need -joinby-. for that.

        Comment


        • #5
          Hi everyone,

          Thank you for very much for your replies. Sorry for not posting my code. I will do so in this post. Hopefully, I will be clearer.

          I do indeed want a data set that, in the same observation, contains data from one firm in a given year for some variables, and data from a different firm from that same year and from the same industry. This will allow me to compare the balance sheet of the first firm to all the other firms (of the same industry for the same year) and hence assess which of the other firms resembles most the first firm.

          I've simplified my first data set:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str6 gvkey double fyear str10 cusip str58 conm double(ceq csho prcc_f) str4 sic float(MV_equity BtMkt startyear) str8 sicyear
          "064486" 1999 "402040109" "GUITAR CENTER INC"           80.319 22.023 10.0625 "5700" 221.60643  .3624398 1999 "57001999"
          "165914" 2009 "169656105" "CHIPOTLE MEXICAN GRILL INC" 703.461 31.483   88.16 "5812"  2775.541 .25345004 2009 "58122009"
          end
          It now only has two arrays and the "sicyear" ID code is unique.

          The second data set still looks the same (I'm showing only the first four array but there are many more):

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str6 gvkey2 double fyear2 str10 cusip2 str58 conm2 double(ceq2 csho2 prcc_f2) str4 sic2 float(MV_equity2 BtMkt2) str8 sicyear
          "010802" 1980 "902740109" "UI GROUP INC"                      48.954     2.236             14.625 "0100"      32.7015      1.4969956 "01001980"
          "001266" 1981 "016230104" "ALICO INC"                         15.709     1.887               48.5 "0100"      91.5195      .17164648 "01001981"
          "002099" 1981 "073799108" "BEAR CREEK CORP"                   28.162       1.9              15.25 "0100"       28.975       .9719414 "01001981"
          "002812" 1981 "256603101" "DOLE FOOD CO INC"                 442.415    25.534        12.74999683 "0100"     325.5584       1.358942 "01001981"
          My code is as follows:

          Code:
          sort sicyear
          merge 1:m sicyear using "C:\Users\Desktop\Matching Companies\Company_data_edit.dta"
          The master data is the first data set I showed above in this post (the one with only two arrays). Even though the master data has unique ID codes, I get different results for the merging: most often I get 80 matches but sometimes I get only 77 matches. And I can't get why I'd get different results.

          Any suggestion? Thanks

          Aurele
          Last edited by Aurele Major; 17 Jan 2017, 22:31.

          Comment

          Working...
          X