Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Aggregating observations for different variables based on an ID

    Hi there,

    I have a dataset on ownership concetration for various firms with roughly 490000 observations . As i am interested in ownership concentrations for each particular firm, the problem i have with my dataset is that, ownership concentrations are not given collectively for the firm but rather independently for each individual within a firm. Hence, my questions is how i can aggregate the observations for the variables sh_direct1 and sh_total1 to one observations for each firm? Below you can find an example of my data

    input str16 idnr double idnr1 str2 sh_type str6(sh_direct sh_total) long sh_date str24 repbas_header long sh_type1 float(sh_type2 firmid) long(sh_direct1 sh_total1) float sh_direct1temp
    "DKFO3" 3 "I" ">5.00" "" 21598 "Consolidated data" 8 8 1 8877 . 26631
    "DKFO3" 3 "I" ">5.00" "" 21598 "Consolidated data" 8 8 1 8877 . 26631
    "DKFO3" 3 "I" ">5.00" "" 21598 "Consolidated data" 8 8 1 8877 . 26631
    "RO19" 19 "I" "-" "" 21579 "Unconsolidated data" 8 8 2 1 . 809
    "RO19" 19 "I" "16.05" "" 21365 "Unconsolidated data" 8 8 2 808 . 809
    "GBSE000027" 27 "I" "10.27" "" 21419 "Unconsolidated data" 8 8 3 229 . 21159
    "GBSE000027" 27 "I" "9.82" "" 21419 "Unconsolidated data" 8 8 3 8015 . 21159
    "GBSE000027" 27 "I" "9.82" "" 21419 "Unconsolidated data" 8 8 3 8015 . 21159
    "GBSE000027" 27 "I" "10.27" "" 21419 "Unconsolidated data" 8 8 3 229 . 21159
    "GBSE000027" 27 "I" "19.64" "" 21419 "Unconsolidated data" 8 8 3 1167 . 21159
    "GBSE000027" 27 "I" "40.18" "" 21419 "Unconsolidated data" 8 8 3 3504 . 21159
    "MTC35" 35 "I" "2.23" "" 20939 "Unconsolidated data" 8 8 4 1226 . 8459
    "MTC35" 35 "I" "0.48" "" 20939 "Unconsolidated data" 8 8 4 50 . 8459
    "RO35" 35 "I" "23.71" "" 21579 "Unconsolidated data" 8 8 4 1674 . 8459
    "MTC35" 35 "I" "6.83" "" 20939 "Unconsolidated data" 8 8 4 5509 . 8459
    "GBJE67" 67 "I" "NG" "" 21166 "Consolidated data" 8 8 5 8892 . 44460
    "GBJE67" 67 "I" "NG" "" 21166 "Consolidated data" 8 8 5 8892 . 44460
    "GBJE67" 67 "I" "NG" "" 21166 "Consolidated data" 8 8 5 8892 . 44460
    "GBJE67" 67 "I" "NG" "" 21166 "Consolidated data" 8 8 5 8892 . 44460
    "GBJE67" 67 "I" "NG" "" 21166 "Consolidated data" 8 8 5 8892 . 44460
    "MTC68" 68 "I" "0.36" "" 21605 "Unconsolidated data" 8 8 6 38 . 76
    "MTC68" 68 "I" "0.36" "" 21605 "Unconsolidated data" 8 8 6 38 . 76
    "GBSE000075" 75 "I" "19.11" "" 21317 "Consolidated data" 8 8 7 1114 . 15307
    "GBSE000075" 75 "I" "20.25" "" 21317 "Consolidated data" 8 8 7 1328 . 15307
    "GBSE000075" 75 "I" "NG" "" 21317 "Consolidated data" 8 8 7 8892 . 15307
    "GBSE000075" 75 "I" "20.21" "" 21317 "Consolidated data" 8 8 7 1324 . 15307
    "GBSE000075" 75 "I" "20.18" "" 21317 "Consolidated data" 8 8 7 1321 . 15307
    "GBSE000075" 75 "I" "20.24" "" 21317 "Consolidated data" 8 8 7 1327 . 15307
    "GBSE000075" 75 "I" "-" "MO" 21564 "Consolidated data" 8 8 7 1 1128 15307
    "GBSE000078" 78 "I" "-" "3.93" 21586 "Consolidated data" 8 8 8 1 604 4782
    "GBSE000078" 78 "I" "-" "10.22" 21184 "Consolidated data" 8 8 8 1 175 4782
    "GBSE000078" 78 "I" "0.01" "" 21184 "Consolidated data" 8 8 8 3 . 4782
    "GBSE000078" 78 "I" "4.82" "" 21264 "Consolidated data" 8 8 8 3468 . 4782
    "GBSE000078" 78 "I" "0.02" "" 21184 "Consolidated data" 8 8 8 4 . 4782
    "GBSE000078" 78 "I" "-" "1.05" 21457 "Consolidated data" 8 8 8 1 100 4782
    "GBSE000078" 78 "I" "0.01" "" 21184 "Consolidated data" 8 8 8 3 . 4782

    For example for the firm MTC35 i would like to have just one observation with the total amount of 2.71.

    I have already grouped the firms by giving them their individual firm IDs, and used the following 2 codes trying to summarize the observations for each individual observation.
    Code:
    egen sh_direct1temp = sum(sh_direct1), by(firmid)
    collapse (sum) sh_direct1, by (firmid)
    However as can be seen above the summation doesn't add up, not sure if this is due to the format as sh_direct1 and sh_total1 are supposed to be percentages of ownership concentration.

    Thus to re-emphasise i would like have for each individual firm one observation with the total sh_direct1 and sh_total1 (even though in this case this firm doesn't have a percentage for sh_total1) instead of three like is the case with DKF03 or firmid 1 in this case.

    Any help would be greatly appreciated, Thank you!

    Regards,

    Shariff

  • #2
    hi Shariff, what do you mean by saying "above the summation doesn't add up"? I can clearly see it does add up. I don't think your problem is a difficult one if you can put it more clearly.
    2B or not 2B, that's a question!

    Comment


    • #3
      well, there are actually two questions I cannot answer although I know what they are now. First is what you are interested in is a string variable (including ">5")that cannot be added up. Second, what you added up in your code is rather another variable sh_direct1 (not sh_direct). You cannot expect the sum to be same.
      2B or not 2B, that's a question!

      Comment


      • #4
        I can make a good guess at what has happened here.

        The story is tedious, so giving away the punchline at the outset is not much of a spoiler. (For what really happens in Game of Thrones series 8, episodes 5 and 6, watch out for my next post.)

        Somebody used encode. They should have used destring.

        To understand my guess let's focus on what is important, as some of the variables here have no bearing on the question. I use groups (Stata Journal) but all you need to know there is that this is just a selective listing.

        Code:
        . groups firmid idnr *direct* , show(none) sepby(idnr)
        
          +------------------------------------------------------+
          | firmid         idnr   sh_dir~t   sh_dir~1   sh_dir~p |
          |------------------------------------------------------|
          |      1        DKFO3      >5.00       8877      26631 |
          |------------------------------------------------------|
          |      2         RO19          -          1        809 |
          |      2         RO19      16.05        808        809 |
          |------------------------------------------------------|
          |      3   GBSE000027      10.27        229      21159 |
          |      3   GBSE000027      19.64       1167      21159 |
          |      3   GBSE000027      40.18       3504      21159 |
          |      3   GBSE000027       9.82       8015      21159 |
          |------------------------------------------------------|
          |      4        MTC35       0.48         50       8459 |
          |      4        MTC35       2.23       1226       8459 |
          |      4        MTC35       6.83       5509       8459 |
          |------------------------------------------------------|
          |      4         RO35      23.71       1674       8459 |
          |------------------------------------------------------|
          |      5       GBJE67         NG       8892      44460 |
          |------------------------------------------------------|
          |      6        MTC68       0.36         38         76 |
          |------------------------------------------------------|
          |      7   GBSE000075          -          1      15307 |
          |      7   GBSE000075      19.11       1114      15307 |
          |      7   GBSE000075      20.18       1321      15307 |
          |      7   GBSE000075      20.21       1324      15307 |
          |      7   GBSE000075      20.24       1327      15307 |
          |      7   GBSE000075      20.25       1328      15307 |
          |      7   GBSE000075         NG       8892      15307 |
          |------------------------------------------------------|
          |      8   GBSE000078          -          1       4782 |
          |      8   GBSE000078       0.01          3       4782 |
          |      8   GBSE000078       0.02          4       4782 |
          |      8   GBSE000078       4.82       3468       4782 |
          +------------------------------------------------------+
        sh_direct1 is a string variable and that is easily explained. Characters like > just can't fit into a numeric variable. At input -- perhaps on copy and paste from a spreadsheet, or something similar -- they could not be imported as numeric.

        So, what has been done? Shariff or someone else earlier has used encode to produce a numeric variable.

        That has produced nonsense.

        The distinct string values have just been mapped to integers 1 up without taking any note of their numeric content except what is implied by their dictionary order.

        This is even easier to see when you look at the mapping. Evidently this is a sample from a much larger set, but what encode does (in the absence of other instructions) is to sort the distinct string values and work out a mapping to integers 1 up. The mapping doesn't even preserve the numeric ordering, but we have here a robot, not a program that can possibly know that you don't mean what you say.

        Code:
        . groups sh_direct sh_direct1, show(none) sep(0)
        
          +---------------------+
          | sh_dir~t   sh_dir~1 |
          |---------------------|
          |        -          1 |
          |     0.01          3 |
          |     0.02          4 |
          |     0.36         38 |
          |     0.48         50 |
          |    10.27        229 |
          |    16.05        808 |
          |    19.11       1114 |
          |    19.64       1167 |
          |     2.23       1226 |
          |    20.18       1321 |
          |    20.21       1324 |
          |    20.24       1327 |
          |    20.25       1328 |
          |    23.71       1674 |
          |     4.82       3468 |
          |    40.18       3504 |
          |     6.83       5509 |
          |     9.82       8015 |
          |    >5.00       8877 |
          |       NG       8892 |
          +---------------------+
        So, you should use destring. You will have to do something about inequalities and perhaps other stuff.

        A paper by Clyde Schechter and myself at https://journals.sagepub.com/doi/abs...867X1801800413 spells this out, and much else besides.

        Comment

        Working...
        X