Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with merging dataset


    We have tried to merge the datasets we've obtained from CDS, Compustat and CRSP and we've run into some trouble.

    First, I would like to ask if we did the right thing by
    1: obtaining the 6-digit CUSIP values from CDS
    2: using the CUSIPs to obtain Compustat data
    3: using the gvkey values from the Compustat data to obtain permnos using the link function of the Compustat-CRSP merged database
    4: using the permnos to obtain CRSP data

    We decided to merge the datasets we obtained in reverse order, starting with the CRSP data, but when we came to the CDS data, we expected to see the monthly data expand so that there would be monthly data for every company for every event date. This went wrong and for example with the first firm in the dataset, there was only one observation that matched the first event date and only one for the second event date, but all other observations were matched with a third event date. Maybe using an m:m was a bad idea in this case. Could you help us figure out what to do to fix our problem?

  • #2
    Maybe using an m:m was a bad idea in this case.
    Yes. m:m merging is almost always a bad idea. Usually when people are tempted to do m:m merges they either are overlooking important aspects of the structure of their data, or they need to use some other command altogether.

    That said, for those of us who don't work in finance, your description is quite confusing. Show us small, representative samples of the data sets you want to merge together, along with a display of how the final result from those samples should look and probably somebody can help you.

    Anticipatory guidance: do not show the data by attaching spreadsheets or posting screenshots. Open actual Stata data sets and use the -dataex- command to generate code that can be pasted into the Forum so that those who might want to try working on this can quickly, accurately, and easily replicate your example data. If you do not already have the -dataex- command installed (it is user written by Robert Picard and Nick Cox), run -ssc install dataex-, and then follow the simple instructions to be found in -help dataex-.

    Comment


    • #3
      SDC
      Repurchase observation -> open market repurchase announcement (event days)
      We need CUSIP(8 digit) (we found only 6digit) PERMNO(not there) Firm name

      CRSP
      Market price
      Dialy closing price
      Holding period returns
      Returns on equally weighted CRSP intex
      Risk free rate
      Share Price
      Number of shares outstanding
      Momentum
      Liquidity
      gvkey

      CRSP COMPOSTAT MERGED
      PERMNO

      Compustat
      Total Assets
      Intangibles
      Liabilities

      We had to download all this data from different dataset, so the next step was to merge them all in a master file. In order to do this , first we merged all our monthly data, which seemed just fine.

      However, our companies with their own company IDs had to be matched with several observations per company ID, so that one company ID would match with several observations in our final dataset.
      As these would have to be matched by company ID, because of the different observations per company in both datafiles so that we would end up with a dataset with the size of A*B, where A and B are observations in each respective file.

      I think this was our main problem, but as we're competely new to STATA, here's roughly what we did: We obtained the CUSIPs from SDC/Thomson One, then expanded these CUSIPs in order to obtain data from Compustat. Compustat gave us gvkey codes that we used in CRSP-Compustat-merged to finally obtain the permnos that we used in CRSP. We started merging our data in reverse order (We merged the CRSP data with the GVKEY data, merged that with the Compustat data and finally merged that with the SDC/Thomson One data) and we think we managed to merge the monthly dataset until we hit the following snag: (see picture)
      The problem as you see from the highlighted area in the picture, after we attempted to merge our event data (SDC) and our monthly data in an m:m manner, we did not get a dataset with the size of the event data multiplied by the amount of months (more or less A*B), but instead, only one of the monthly observations was matched to the event date, leaving us with a dataset with size B (the monthly data). Instead of using code to handle these merges, we have used the drop-down menus to reduce the chance of encountering unexpected errors that would slow us down drastically.

      So, the problem is that we were unable to match our event dates to the monthly data with the same company code.
      Here are the commands
      to create string gvkey* rename gvkey num_gvkey tostring num_gvkey, generate(gvkey) *to put zero's infront of the gvkey* replace gvkey = "0" + gvkey if strlen(gvkey) == 4 replace gvkey = "0" + gvkey if strlen(gvkey) == 5 *string mofd* book value file tostring mofd, generate(str_mofd) *create a new identifier* generate gvkey_monthly = gvkey + "+" + str_mofd *string mofd* stoc price file tostring mofd, generate(str_mofd) *create a new identifier* generate gvkey_monthly = gvkey + "+" + str_mofd drop _merge merged again keep if _merge==3 *Create new identifier* stockprice file generate ncusip6 = substr(ncusip, 1,6) *event data file* rename cusip6 ncusip6 *stock price file* drop _merge keep if _merge==3




      Comment


      • #4
        This is the picture

        Comment


        • #5
          As these would have to be matched by company ID, because of the different observations per company in both datafiles so that we would end up with a dataset with the size of A*B, where A and B are observations in each respective file.
          If what you described in this quotation was the goal of your merge m:m then it is possible that what you wanted was a joinby command instead.

          With that said, if you need further advice, Statalist can better help you if you review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. See especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using CODE delimiters, as described in section 12 of the FAQ. Posting a screenshot and unreadable code, ignoring the well-meant advice given in post #2 by Clyde, one of the most prolific and knowledgeable participants in Statalist, does not encourage those of us who are most likely to be able to help you.

          Comment


          • #6
            Yes you are right, we are sorry..We spent a little time to find out how to use the commant and which files to send you. The problem is that when we try to have the dataex we have too many data and we can not sent you 1000 cariable from each file. Do you know how we can select just one-two firms to send in order to understant the problem and have a small sample?

            So first we tried to merge the file GVKEY with STOCK PRICE, we used the variable PERMNO in order to merge them, we used from the merge faction in the menu DATA->COMPINE DATASETS->MERGE TWO DATASETS->MANY TO MANY, key variable PERMNO

            Then, in order to merge our new master file with BOOK VALUE we create a monthly GVKEY as an identifier:
            and file where we rename lpermno in permno


            Because in the resulting master file, the gvkey values are all 6-digit strings and in BOOK TO VALUE file the gvkeys are numeric and have variable lengths, ranging from 4 to 6 digits, we adjusted the values in BOOK TO VALUE using:
            Code:
              
             *to turn the gvkey values into strings* rename gvkey num_gvkey tostring num_gvkey, generate(gvkey)    
             *to put zero's infront of the gvkey values* replace gvkey = "0" + gvkey if strlen(gvkey) == 4  replace gvkey = "0" + gvkey if strlen(gvkey) == 5
            In order to merge the files properly, we added a monthly identifier to the company identifier (gvkey) in both BOOK TO VALUE and our master file by combining the gvkey string and the mofd string (monthly identifier):
            Code:
            *turn mofd (monthly identifier) into strings* tostring mofd, generate(str_mofd)  
              *create a new identifier* generate gvkey_monthly = gvkey + "+" + str_mofd  
             drop _merge
            After, we merge the master file with the BOOK TO VALUE file by using: DATA->COMPINE DATASETS->MERGE TWO DATASETS->MANY TO MANY, key variable gvkey_monthly
            Code:
            keep if _merge==3
            Finally, we wanted to merge the master file with the EVENT DAY file In EVENT DAY file the cusip6 variable had a length of 6, which we wanted to use as the key variable to merge it with BOOK VALUE file, but in the master file the name was ncusip and had a length of 8 digits, so first we delete the last 2 digit to turn it into a 6 digit variable and then in order to merge with the event file we rename it also to ncusip6 in our master:
            Code:
              
              *Create new identifier* stockprice file generate ncusip6 = substr(ncusip, 1,6)  *event data file* rename cusip6 ncusip6
            After, we merge the master file with the EVENT DAY file by using: DATA->COMPINE DATASETS->MERGE TWO DATASETS->MANY TO MANY, key variable gvkey_monthly
            Code:
              
              *master file*  drop _merge  keep if _merge==3
            Last edited by EIRINI FANOURIA; 20 Feb 2016, 08:19.

            Comment


            • #7
              SO STOCK PRICE file
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input float mofd str8 ncusip double permno
              1 "14979210" 10009
              2 "14979210" 10009
              3 "46334710" 10009
              4 "46334710" 10009
              5 "46334710" 10009
              6 "46334710" 10009
              1 "12709510" 10010
              2 "12709510" 10010
              3 "12709510" 10010
              4 "12709510" 10010
              5 "12709510" 10010
              6 "12709510" 10010
              1 "49307510" 10020
              2 "49307510" 10020
              3 "49307510" 10020
              4 "49307510" 10020
              5 "49307510" 10020
              6 "49307510" 10020
              1 "46603210" 10026
              2 "46603210" 10026
              3 "46603210" 10026
              4 "46603210" 10026
              5 "46603210" 10026
              6 "46603210" 10026
              1 "38209110" 10035
              2 "38209110" 10035
              3 "38209110" 10035
              4 "38209110" 10035
              5 "38209110" 10035
              6 "38209110" 10035
              1 "03267210" 10043
              2 "03267210" 10043
              3 "03267210" 10043
              4 "03267210" 10043
              5 "03267210" 10043
              6 "03267210" 10043
              1 "15650310" 10048
              2 "15650310" 10048
              3 "15650310" 10048
              4 "15650310" 10048
              5 "15650310" 10048
              6 "15650310" 10048
              1 "65163310" 10056
              2 "65163310" 10056
              3 "65163310" 10056
              4 "65163310" 10056
              5 "65163310" 10056
              6 "65163310" 10056
              1 "01390210" 10071
              2 "01390210" 10071
              3 "01390210" 10071
              4 "01390210" 10071
              5 "01390210" 10071
              6 "01390210" 10071
              1 "07465510" 10087
              2 "07465510" 10087
              3 "07465510" 10087
              4 "07465510" 10087
              5 "07465510" 10087
              6 "07465510" 10087
              end

              GVKEY
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input double lpermno str6 gvkey
              10035 "011974"
              10048 "012102"
              10048 "012102"
              10071 "012261"
              10043 "012307"
              10010 "012622"
              10026 "012825"
              10020 "013073"
              10087 "013119"
              10009 "016739"
              10056 "017081"
              end
              BOOK VALUE file

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long gvkey float mofd
              11974 3
              11974 6
              12102 2
              12102 5
              12261 3
              12261 6
              12307 3
              12307 6
              12622 1
              12622 4
              12825 3
              12825 6
              13073 3
              13073 6
              13119 2
              13119 5
              16739 3
              16739 6
              17081 3
              17081 6
              11974 4
              11974 5
              12102 3
              12102 4
              12102 6
              12261 4
              12261 5
              12307 4
              12307 5
              12622 2
              12622 3
              12622 5
              12622 6
              12825 4
              12825 5
              13073 4
              13073 5
              13119 3
              13119 4
              13119 6
              16739 4
              16739 5
              17081 4
              17081 5
              end
              EVENT FILE
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str6 cusip6 float eventdate2
              "466032" 14598
              "463347" 14545
              "466032" 13109
              "466032" 12563
              "127095" 12045
              "466032" 11968
              "466032" 11857
              "493075" 11716
              "651633" 14979
              "382091" 13760
              "156503" 13739
              "032672" 13664
              "032672" 13233
              "382091" 13177
              "074655" 13010
              "013902" 12170
              "156503" 11994
              end
              format %td eventdate2

              Comment


              • #8
                OK, this is helpful, but I have a few more questions.

                What is the relationship between ncusip and cusip6? It looks like the cusip6 is just the first six characters of ncusip. Is that correct?

                Is lpermno the same thing as permno? If not, how, if at all, are they related?

                It appears as if there is a 1-1 correspondence among cusip6, permno, and gvkey--that these are just different identifier systems for firms and that in principle knowing any one of them you can look up the others. Is that correct?

                Finally, do the observations of the data set you are trying to create consist of: cusip6 permno gvkey eventdate and mofd, and there are potentially several observations for each firm?
                Last edited by Clyde Schechter; 20 Feb 2016, 09:35. Reason: Correct grammatical error.

                Comment


                • #9
                  I think some more clarification beyond what I asked in #8 is needed. Experimenting with the data, I see that the answer to my third question in #8 is no: the same cusip6 can correspond to multiple permno values.

                  But more important, looking at the event file and the stock price file, it is evident that every cusip can potentially have both multiple event dates and multiple mofd values. So when you put this all together, how do you decide which event date(s) to match up with which mofd value(s) in the file you wish to create. I don't see how to use gvkey or permno to decide this question. So what is the rule for association?

                  Comment


                  • #10
                    What is the relationship between ncusip and cusip6? It looks like the cusip6 is just the first six characters of ncusip. Is that correct?

                    Yes, ncusip6 as found in the stock price file is just the first 6 digits of ncusip, but the cusip6 in the event data file was obtained from a different dataset, so that's why we were merging them using matching values in these variables.

                    Is lpermno the same thing as permno? If not, how, if at all, are they related?

                    They are the same, sorry for the confusion

                    Finally, do the observations of the data set you are trying to create consist of: cusip6 permno gvkey eventdate and mofd, and there are potentially several observations for each firm?

                    Cusip6, permno and gvkey are identifying variables for the company's stock. mofd is a time series (monthly) variable for every company, so each firm will correspond to many mofd values. Finally, eventdate corresponds to events that occurred to a specific firm, where multiple events may occur to the same company.

                    But more important, looking at the event file and the stock price file, it is evident that every cusip can potentially have both multiple event dates and multiple mofd values. So when you put this all together, how do you decide which event date(s) to match up with which mofd value(s) in the file you wish to create. I don't see how to use gvkey or permno to decide this question. So what is the rule for association?

                    The rule of association is that one set of months (mofd values) should correspond to one event date, so that if, for example, two events occur in one firm, the corresponding firm should have two sets of months (mofd values), one set for each event.

                    Thank you so much for helping us out!

                    Comment


                    • #11
                      Well, we're getting there, but I still have two problems:

                      The rule of association is that one set of months (mofd values) should correspond to one event date, so that if, for example, two events occur in one firm, the corresponding firm should have two sets of months (mofd values), one set for each event.
                      OK, but which mofd value goes with which event date? Is it mofd = 1 for the earliest event date, mofd = 2 for the next, etc.? Or is there some other rule? Along these lines, when I merge together the stock price file, the gvkey file and the book value file ignoring cusip6 (see next question below), each gvkey/permno is associated with exactly 6 mofd values. But in the event file , most of the cusip's have only one associated event, a few have 2, and the largest number of events for any cusip is 5 (which only occurs with one cusip). So regardless of how the mofd's are matched up with event dates, there will be excess mofd's unless there is some rule to assign multiple mofd's to the same date? Is there? If not, what do we do with the "excess" mofd's?

                      Cusip6, permno and gvkey are identifying variables for the company's stock.
                      Except that in your data, it isn't so. In your stock price file, permno 10009 sometimes corresponds to cusip6 149792 and sometimes to 463347. Is that a data error? If not, is there some rule to select the appropriate cusip from among the options for each permno?
                      Last edited by Clyde Schechter; 20 Feb 2016, 11:20. Reason: Correct grammatical error.

                      Comment


                      • #12
                        OK, but which mofd value goes with which event date? Is it mofd = 1 for the earliest event date, mofd = 2 for the next, etc.? Or is there some other rule? Along these lines, when I merge together the stock price file, the gvkey file and the book value file ignoring cusip6 (see next question below), each gvkey/permno is associated with exactly 6 mofd values. But in the event file , most of the cusip's have only one associated event, a few have 2, and the largest number of events for any cusip is 5 (which only occurs with one cusip). So regardless of how the mofd's are matched up with event dates, there will be excess mofd's unless there is some rule to assign multiple mofd's to the same date? Is there? If not, what do we do with the "excess" mofd's?
                        The mofd's are the monthly indicators that correspond to other variables such as stock prices, returns etc. After we've merged our data, we will compare the regular mofd's to the event mofd's and focus on the months where they match. In our full dataset, every stock has a variable amount of mofd's belonging to it, up to 144, so we had to make a small selection of 6 months in order to obtain a small, but functional sample.

                        Except that in your data, it isn't so. In your stock price
                        file, permno 10009 sometimes corresponds to cusip6 149792 and sometimes to 463347. Is that a data error? If not, is there some rule to select the appropriate cusip from among the options for each permno?


                        We notice a change in company name in this case, so there's a good chance that it's a data error. Thank you for noticing! As for selecting the appropriate cusip in order to remove further possible data errors systematically, we're not entirely sure.

                        Comment


                        • #13
                          OK, that's helpful. But a key question remains unanswered. The event file contains only cusip and event dates. You say you will match the regular mofd's to the event mofd's, but there is no mofd information in the event file. So how do you do this. Is there some rule that mofd = 1 corresponds to January 1992, mofd = 2 to February 1992, etc.or something like that?

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            OK, that's helpful. But a key question remains unanswered. The event file contains only cusip and event dates. You say you will match the regular mofd's to the event mofd's, but there is no mofd information in the event file. So how do you do this. Is there some rule that mofd = 1 corresponds to January 1992, mofd = 2 to February 1992, etc.or something like that?
                            Yes, mofd is generated by using the month, so mofd = 1 will always correspond to January 1990, mofd = 2 to February 1990 and so on. The event dates have been converted to mofd's as well, so event mofd = 1 will also correspond to january 1990

                            Comment


                            • #15
                              OK. I think we've got it. There are two basic steps. The first is to create a crosswalk between cusip6, gvkey, and permno as these are the different identifiers used in the various files. Then, we merge together the various source files, where the merge key is not just cusip6 or gvkey or permno, but the combination of cusip6/gvkey/permno and mofd.

                              A certain amount of cleanup of the example files was needed to do this: the presence of 2 cusips for the same permno in the stock price file is a problem. The original poster says he doesn't know how to resolve it. So I have resolved it arbitrarily by keeping only the numerically first cusip. Presumably the original poster will look into the matter further and can modify that decision if it isn't appropriate. In addition, the eventfile has daily dates, but those need to be converted to the mofd coding that starts in January 1990 so they could be linked up with the other files.

                              So here is my proposed solution:
                              Code:
                              * Example generated by -dataex-. To install: ssc install dataex
                              clear
                              input float mofd str8 ncusip double permno
                              1 "14979210" 10009
                              2 "14979210" 10009
                              3 "46334710" 10009
                              4 "46334710" 10009
                              5 "46334710" 10009
                              6 "46334710" 10009
                              1 "12709510" 10010
                              2 "12709510" 10010
                              3 "12709510" 10010
                              4 "12709510" 10010
                              5 "12709510" 10010
                              6 "12709510" 10010
                              1 "49307510" 10020
                              2 "49307510" 10020
                              3 "49307510" 10020
                              4 "49307510" 10020
                              5 "49307510" 10020
                              6 "49307510" 10020
                              1 "46603210" 10026
                              2 "46603210" 10026
                              3 "46603210" 10026
                              4 "46603210" 10026
                              5 "46603210" 10026
                              6 "46603210" 10026
                              1 "38209110" 10035
                              2 "38209110" 10035
                              3 "38209110" 10035
                              4 "38209110" 10035
                              5 "38209110" 10035
                              6 "38209110" 10035
                              1 "03267210" 10043
                              2 "03267210" 10043
                              3 "03267210" 10043
                              4 "03267210" 10043
                              5 "03267210" 10043
                              6 "03267210" 10043
                              1 "15650310" 10048
                              2 "15650310" 10048
                              3 "15650310" 10048
                              4 "15650310" 10048
                              5 "15650310" 10048
                              6 "15650310" 10048
                              1 "65163310" 10056
                              2 "65163310" 10056
                              3 "65163310" 10056
                              4 "65163310" 10056
                              5 "65163310" 10056
                              6 "65163310" 10056
                              1 "01390210" 10071
                              2 "01390210" 10071
                              3 "01390210" 10071
                              4 "01390210" 10071
                              5 "01390210" 10071
                              6 "01390210" 10071
                              1 "07465510" 10087
                              2 "07465510" 10087
                              3 "07465510" 10087
                              4 "07465510" 10087
                              5 "07465510" 10087
                              6 "07465510" 10087
                              end
                              gen cusip6 = substr(ncusip, 1, 6)
                              // CLEAN UP DUPLICATE CUSIPS WITHIN PERMNO
                              // BY ARBITRARILY CHOOSING FIRST CUSIP
                              by permno (cusip6), sort: replace cusip6 = cusip6[1]
                              tempfile stock_price_file
                              save `stock_price_file'
                              
                              * Example generated by -dataex-. To install: ssc install dataex
                              clear
                              input double lpermno str6 gvkey
                              10035 "011974"
                              10048 "012102"
                              10048 "012102"
                              10071 "012261"
                              10043 "012307"
                              10010 "012622"
                              10026 "012825"
                              10020 "013073"
                              10087 "013119"
                              10009 "016739"
                              10056 "017081"
                              end
                              rename lpermno permno
                              tempfile gvkey_file
                              save `gvkey_file'
                              
                              * Example generated by -dataex-. To install: ssc install dataex
                              clear
                              input long gvkey float mofd
                              11974 3
                              11974 6
                              12102 2
                              12102 5
                              12261 3
                              12261 6
                              12307 3
                              12307 6
                              12622 1
                              12622 4
                              12825 3
                              12825 6
                              13073 3
                              13073 6
                              13119 2
                              13119 5
                              16739 3
                              16739 6
                              17081 3
                              17081 6
                              11974 4
                              11974 5
                              12102 3
                              12102 4
                              12102 6
                              12261 4
                              12261 5
                              12307 4
                              12307 5
                              12622 2
                              12622 3
                              12622 5
                              12622 6
                              12825 4
                              12825 5
                              13073 4
                              13073 5
                              13119 3
                              13119 4
                              13119 6
                              16739 4
                              16739 5
                              17081 4
                              17081 5
                              end
                              tostring gvkey, format(%06.0f) replace
                              isid gvkey mofd
                              tempfile book_value_file
                              save `book_value_file'
                              
                              * Example generated by -dataex-. To install: ssc install dataex
                              clear
                              input str6 cusip6 float eventdate2
                              "466032" 14598
                              "463347" 14545
                              "466032" 13109
                              "466032" 12563
                              "127095" 12045
                              "466032" 11968
                              "466032" 11857
                              "493075" 11716
                              "651633" 14979
                              "382091" 13760
                              "156503" 13739
                              "032672" 13664
                              "032672" 13233
                              "382091" 13177
                              "074655" 13010
                              "013902" 12170
                              "156503" 11994
                              end
                              format %td eventdate2
                              // MAP EVENT DATES TO "mofd" WITH mofd = 1
                              // CORRESPONDING TO January 1990
                              gen mofd = mofd(eventdate2)-tm(1989m12)
                              isid cusip6 mofd
                              tempfile event_file
                              save `event_file'
                              
                              // CREATE A CROSSWALK BETWEEN
                              // CUSIP, GVKEY, AND PERMNO
                              
                              use `stock_price_file', clear
                              keep cusip6 permno
                              duplicates drop
                              isid cusip6
                              isid permno
                              tempfile crosswalk
                              save `crosswalk'
                              use `gvkey_file', clear
                              duplicates drop
                              isid permno
                              isid gvkey
                              merge 1:1 permno using `crosswalk', assert(match) nogenerate
                              save `"`crosswalk'"', replace
                              
                              // NOW MERGE IN THE STOCK PRICE DATA
                              merge 1:m cusip6 using `stock_price_file'
                              drop _merge
                              isid cusip6 mofd
                              
                              // AND NOW MERGE IN THE BOOK VALUE DATA
                              merge 1:1 gvkey mofd using `book_value_file'
                              drop _merge
                              isid cusip6 mofd
                              
                              // AND FINALLY MERGE IN THE EVENT DATA
                              merge 1:1 cusip6 mofd using `event_file'
                              drop _merge
                              Notes:

                              1. In creating the crosswalk file, the merge comes with -assert(match)- because if there are cusip's, gvkey's, or permno's that do not have a corresponding values we can't really proceed.

                              2. By contrast, when merging in the substantive data on stock prices, book values, and events, it is anticipatable that there may be some unmatchable data. That is how my epidemiologic real world works, and I assume the same can happen in finance. And even in the example data provided, there are some events that do not match because their dates are out of range of the dates in the other files. So no -assert(match)- option specified for these merges.

                              3. In the original poster's description of the data sets, the stock price and book value are regular monthly data, so I have assumed (and verified in the code) that each firm has only a single observation in any given month. This will need to be true in the original data, not just to pass the -isid-s, but for the whole process to work coherently. If it is not true in the real data, then decisions need to be made about which value of a given firm's stock price/book value in a given month to use (or whether to average them, or whatever), but a single value is needed.

                              4. On the other hand, although in the example data each firm experiences at most a single event in a given month, I do not see anything in the original description that guarantees this will always be true. I have relied on that single event per month restriction by using -merge 1:1- in the final step where the event data is merged in. However, if a firm can in fact experience more than one event per month, the 1:1 can be changed to 1:m and nothing terrible will happen.

                              5. I have used tempfiles to hold all of the example data sets and the crosswalk file created. The few data-transformation/cleanup steps incorporated in those parts of the code need to be followed in the original data, but there is no reason they cannot be saved as real data sets assuming they are useful for other purposes. Probably saving the crosswalk as a real data file would make sense also: I imagine it will prove useful in the future.

                              Comment

                              Working...
                              X