Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining entries for multiple start and end dates under one company variable

    Hi everyone,

    I have S&P500 data that lists when a company (gvkey) entered the S&P500 (from) and exited it (thur or thru_x if it was still there when the data was downloaded). My goal is to create a series of variables for each year from 1964 to 2020 that are either on or off (1 or 0) if the company was in the S&P 500 that year.

    My big problem is that companies that left the S&P 500 and then returned to it have multiple entries (rather than just being one set of from and thru dates they have multiple entries and exits). Does anyone have any advice on how to approach this problem?

    I started off by creating a count of which gvkeys appear more than once (varibale: a), and then I was trying to combine them, but I ended up creating multiple from and thru variables and that got too messy.


    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long gvkey int(from thru thru_x) float a
    1010 1551 8947 8947 1
    1013 14458 17348 17348 1
    1040 1551 9404 9404 1
    1043 1551 5995 5995 1
    1045 20170 . 22113 2
    1045 1551 15777 15777 2
    1057 1551 3346 3346 1
    1062 3195 10804 10804 1
    1075 14521 . 22113 1
    1078 1551 . 22113 1
    1098 3012 11871 11871 1
    1133 1551 5264 5264 1
    1161 8947 19623 19623 2
    1161 20898 . 22113 2
    1164 13240 15474 15474 1
    1177 6025 21520 21520 1
    1192 1551 2799 2799 1
    1194 4929 14152 14152 1
    1209 9251 . 22113 1
    1215 1551 5537 5537 1
    1221 1551 8278 8278 1
    1230 20587 . 22113 1
    1239 4107 17121 17121 1
    1240 9100 16953 16953 1
    1243 1551 15542 15542 1
    1246 10286 14789 14789 1
    1253 10531 13528 13528 1
    1279 14955 18685 18685 2
    1279 1551 5995 5995 2
    1300 1551 . 22113 1
    1301 1551 9861 9861 1
    1302 1551 1977 1977 1
    1308 6390 10042 10042 1
    1318 12773 17488 17488 1
    1327 20159 . 22113 1
    1356 1551 20758 20758 1
    1359 11302 15150 15150 1
    1362 1551 8308 8308 1
    1365 1551 12369 12369 1
    1380 8917 . 22113 2
    1380 1551 5995 5995 2
    1387 5751 6908 6908 1
    1392 9800 13771 13771 1
    1394 1551 4899 4899 1
    1408 1551 19843 19843 1
    1409 3468 9496 9496 1
    1414 1551 10561 10561 1
    1420 1551 5903 5903 1
    1428 3195 3803 3803 1
    1429 1551 4807 4807 1
    1430 1551 12687 12687 1
    1440 1551 . 22113 1
    1447 6025 . 22113 1
    1449 14392 . 22113 1
    1465 6025 15216 15216 1
    1468 10896 16193 16193 1
    1478 1551 18185 18185 1
    1481 4656 9435 9435 1
    1485 8734 14528 14528 1
    1487 7395 . 22113 1
    1489 1551 5995 5995 1
    1517 1551 10073 10073 1
    1528 1551 9251 9251 1
    1560 1551 3165 3165 1
    1567 15473 17688 17688 2
    1567 1551 10317 10317 2
    1572 1551 7182 7182 1
    1573 7212 14418 14418 1
    1581 1551 16760 16760 1
    1596 2191 9039 9039 1
    1598 19624 . 22113 1
    1602 11689 . 22113 1
    1608 2737 14338 14338 1
    1609 1551 14247 14247 1
    1619 14313 17110 17110 1
    1620 6664 8796 8796 2
    1620 1551 4807 4807 2
    1623 1551 9616 9616 1
    1628 1551 6268 6268 1
    1632 14529 . 22113 1
    1651 9100 17075 17075 1
    1661 14901 20169 20169 1
    1663 6025 17854 17854 1
    1678 13723 . 22113 1
    1690 8369 . 22113 1
    1704 12858 . 22113 1
    1722 1551 3803 3803 2
    1722 7880 . 22113 2
    1738 1551 5995 5995 1
    1746 10623 13732 13732 1
    1753 1551 2160 2160 2
    1753 2281 3711 3711 2
    1755 1551 14206 14206 1
    1758 1551 4076 4076 1
    1762 1551 14930 14930 1
    1789 1551 14543 14543 1
    1794 10835 17849 17849 1
    1803 1551 9769 9769 1
    1809 1551 3438 3438 1
    1831 13977 14947 14947 1
    end
    format %tdnn/dd/CCYY from
    format %tdnn/dd/CCYY thru
    format %tdnn/dd/CCYY thru_x
    [/CODE]

  • #2
    Code:
    assert thru == thru_x if !missing(thru)
    assert from <= thru_x
    
    
    forvalues y = 1964/2020 {
        gen listed_`y' = inrange(`y', year(from), year(thru_x))
    }
    collapse (max) listed*, by(gvkey)
    Note: This code treats a firm as listed in a given year if it is listed on any date in that year.

    I'm not sure why you are doing this and what you plan to do with these variables, but the above code will create them for you. For most purposes, though, there are usually better ways to do things than creating a big list of indicators like this.

    Comment


    • #3
      Hi Clyde,

      This worked perfectly, thank you!

      I'm sure you're right that there is a more efficient/better way to do what I'm planning. I'm merging this dataset with another that pulls in company financial information (merging based on the gvkey). I'm then comparing the financial performance of a subsection of firms from 1964-2020 to the top quartile of S&P 500 companies every year, which is why I needed to know who was in the S&P 500 when, hence the binary date variable.

      Comment


      • #4
        Well, I'm going out a bit on a limb here, having not seen the other data set and not sure in what way you plan to do that comparison, but my hunch is that you should take the results of the code in #2 and turn them into a long data set:
        Code:
        reshape long listed_, i(gvkey) j(year)
        I'm going to guess that this other financial data set also has data year by year for these firms. If it is organized with gvkey and year as separate variables, then you can -merge 1:1 gvkey year- to combine the data sets. If your financial data set is in wide layout (gvkey is a variable, but information for each year is in a separate variable, then -reshape- that one to long before doing the -merge-). If you end up with a data set that is organized with variables gvkey year financial_var1 financial_var2 ..., almost everything you do from that point will be easier than if you have all of a gvkey's data for all years in a single observation.
        Last edited by Clyde Schechter; 06 Apr 2023, 13:36.

        Comment

        Working...
        X