Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining variables based on condition

    Dear Statalists,

    I have a dataset with a (to me, at least) complex form of duplicates. The variables PRODUCTSEGMENT`i'SICCODE (i = 1, 2, 3, 4, 5) indicate if a firm has several different product segments by giving the SIC codes of the industries the firm operates in. The variables PRODUCTSEGMENT`i'SALES (i = 1, 2, 3, 4, 5) give the sales revenue the firm obtains from the corresponding segment.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float NUM int(PRODUCTSEGMENT1SICCODE PRODUCTSEGMENT2SICCODE PRODUCTSEGMENT3SICCODE PRODUCTSEGMENT4SICCODE PRODUCTSEGMENT5SICCODE) long(PRODUCTSEGMENT1SALES PRODUCTSEGMENT2SALES PRODUCTSEGMENT3SALES PRODUCTSEGMENT4SALES PRODUCTSEGMENT5SALES)
     1 5942 5961 5722 7375    . 1688752  661374  547190  225117      .
     2 3728 3728 3728    .    .  419313  357788   97154       .      .
     3 2834 3841 3845 2834    . 8177000 2929000 2778000 2088000      .
     4 4911 4911 4911 4911    . 2729000 2466000 2444000 1688000      .
     5 3577 5045    .    .    . 2464386  760004       .       .      .
     6 3841 8731    .    .    .   22017    2879       .       .      .
     7 5651    .    .    .    . 1364853       .       .       .      .
     8 7363    .    .    .    . 4373244       .       .       .      .
     9 3674 3674    .    .    . 3793962   97792       .       .      .
    10 2835 8731 2834    .    .    4640     700     633       .      .
    11 7372 7379 7372 7372    .  543419  291886  289624  104791      .
    12 7372    .    .    .    .   12887       .       .       .      .
    13 3523 3523 3523 3523 3523 1470300  472200  250300  195300 153400
    14 2813 2869 3569    .    . 3944000 1522800  250400       .      .
    15 3841 8731 2834    .    .   17443   14946    9859       .      .
    16 4581 4513 4522    .    . 1973000   86300   81600       .      .
    17 2297 3442 2297    .    .  698187  102273   36236       .      .
    18    .    .    .    .    .       .       .       .       .      .
    19 8399 8742 7379    .    .  272785  146826   67649       .      .
    20 3724 3724 3724 3724    . 9653000 7185000 3457000 3313000      .
    end
    However, there are duplicates, as the example below shows. Look at row 2. The same SIC code (3728) appear three times, and the corresponding sales variables have different values. Thus, I would need a code that combines the sales into one value if the same SIC code appears multiple times. For example, for row 2, I would like it to look something like this (or with new variables combining the duplicates):
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float NUM int(PRODUCTSEGMENT1SICCODE PRODUCTSEGMENT2SICCODE PRODUCTSEGMENT3SICCODE PRODUCTSEGMENT4SICCODE PRODUCTSEGMENT5SICCODE) long(PRODUCTSEGMENT1SALES PRODUCTSEGMENT2SALES PRODUCTSEGMENT3SALES PRODUCTSEGMENT4SALES PRODUCTSEGMENT5SALES)
    1 5942 5961 5722 7375 . 1688752 661374 547190 225117 .
    2 3728    .    .    . .  874255      .      .      . .
    end
    The challenge is that it might be several values that I need to combine. For example, there might be two different SIC codes that both appear multiple times for a single firm.

    Thank you in advance!

  • #2
    Welcome to Statalist and thanks for the data example. Here is one way.

    Code:
    rename PRODUCTSEGMENT*SICCODE PRODUCTSEGMENTSICCODE*
    reshape long PRODUCTSEGMENTSICCODE, i(NUM) j(which)
    bys NUM (which):gen tag= PRODUCTSEGMENTSICCODE== PRODUCTSEGMENTSICCODE[_n-1] & _n>1
    replace PRODUCTSEGMENTSICCODE=. if tag
    drop tag
    reshape wide PRODUCTSEGMENTSICCODE, i(NUM) j(which)
    rename  PRODUCTSEGMENTSICCODE* PRODUCTSEGMENT*SICCODE

    Res.:

    Code:
    . l
    
         +-------------------------------------------------------------------------------------------------------------------+
         | NUM   PRODUC..   PRODUC..   PRODUC..   PRODUC..   PRODUC..   P~1SALES   P~2SALES   P~3SALES   P~4SALES   P~5SALES |
         |-------------------------------------------------------------------------------------------------------------------|
      1. |   1       5942       5961       5722       7375          .    1688752     661374     547190     225117          . |
      2. |   2       3728          .          .          .          .     419313     357788      97154          .          . |
      3. |   3       2834       3841       3845       2834          .    8177000    2929000    2778000    2088000          . |
      4. |   4       4911          .          .          .          .    2729000    2466000    2444000    1688000          . |
      5. |   5       3577       5045          .          .          .    2464386     760004          .          .          . |
         |-------------------------------------------------------------------------------------------------------------------|
      6. |   6       3841       8731          .          .          .      22017       2879          .          .          . |
      7. |   7       5651          .          .          .          .    1364853          .          .          .          . |
      8. |   8       7363          .          .          .          .    4373244          .          .          .          . |
      9. |   9       3674          .          .          .          .    3793962      97792          .          .          . |
     10. |  10       2835       8731       2834          .          .       4640        700        633          .          . |
         |-------------------------------------------------------------------------------------------------------------------|
     11. |  11       7372       7379       7372          .          .     543419     291886     289624     104791          . |
     12. |  12       7372          .          .          .          .      12887          .          .          .          . |
     13. |  13       3523          .          .          .          .    1470300     472200     250300     195300     153400 |
     14. |  14       2813       2869       3569          .          .    3944000    1522800     250400          .          . |
     15. |  15       3841       8731       2834          .          .      17443      14946       9859          .          . |
         |-------------------------------------------------------------------------------------------------------------------|
     16. |  16       4581       4513       4522          .          .    1973000      86300      81600          .          . |
     17. |  17       2297       3442       2297          .          .     698187     102273      36236          .          . |
     18. |  18          .          .          .          .          .          .          .          .          .          . |
     19. |  19       8399       8742       7379          .          .     272785     146826      67649          .          . |
     20. |  20       3724          .          .          .          .    9653000    7185000    3457000    3313000          . |
         +-------------------------------------------------------------------------------------------------------------------+
    
    .

    To eliminate non-consecutive instances, you can sort by the SIC code

    Code:
    bys NUM (PRODUCTSEGMENTSICCODE which):gen tag= PRODUCTSEGMENTSICCODE== PRODUCTSEGMENTSICCODE[_n-1] & _n>1
    Added in Edit: I missed this

    The same SIC code (3728) appear three times, and the corresponding sales variables have different values. Thus, I would need a code that combines the sales into one value if the same SIC code appears multiple times. For example, for row 2, I would like it to look something like this (or with new variables combining the duplicates)
    Can be easily done with the data in long layout, but here is a way with the data in wide layout.

    Code:
    egen totalSIC = rownonmiss(PRODUCTSEGMENT*SICCODE)
    egen wanted=rowtotal(PRODUCTSEGMENT*SALES) if totalSIC==1
    forval i=2/5{
         replace PRODUCTSEGMENT`i'SALES=. if !missing(wanted)
     }
    replace PRODUCTSEGMENT1SALES=wanted if !missing(wanted)
    Last edited by Andrew Musau; 05 Nov 2019, 07:18.

    Comment


    • #3
      A law of Stata is

      Whenever people want to do many things rowwise, they would usually be better off after a reshape long.

      Here's that thought carried through to code.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float NUM int(PRODUCTSEGMENT1SICCODE PRODUCTSEGMENT2SICCODE PRODUCTSEGMENT3SICCODE PRODUCTSEGMENT4SICCODE PRODUCTSEGMENT5SICCODE) long(PRODUCTSEGMENT1SALES PRODUCTSEGMENT2SALES PRODUCTSEGMENT3SALES PRODUCTSEGMENT4SALES PRODUCTSEGMENT5SALES)
       1 5942 5961 5722 7375    . 1688752  661374  547190  225117      .
       2 3728 3728 3728    .    .  419313  357788   97154       .      .
       3 2834 3841 3845 2834    . 8177000 2929000 2778000 2088000      .
       4 4911 4911 4911 4911    . 2729000 2466000 2444000 1688000      .
       5 3577 5045    .    .    . 2464386  760004       .       .      .
       6 3841 8731    .    .    .   22017    2879       .       .      .
       7 5651    .    .    .    . 1364853       .       .       .      .
       8 7363    .    .    .    . 4373244       .       .       .      .
       9 3674 3674    .    .    . 3793962   97792       .       .      .
      10 2835 8731 2834    .    .    4640     700     633       .      .
      11 7372 7379 7372 7372    .  543419  291886  289624  104791      .
      12 7372    .    .    .    .   12887       .       .       .      .
      13 3523 3523 3523 3523 3523 1470300  472200  250300  195300 153400
      14 2813 2869 3569    .    . 3944000 1522800  250400       .      .
      15 3841 8731 2834    .    .   17443   14946    9859       .      .
      16 4581 4513 4522    .    . 1973000   86300   81600       .      .
      17 2297 3442 2297    .    .  698187  102273   36236       .      .
      18    .    .    .    .    .       .       .       .       .      .
      19 8399 8742 7379    .    .  272785  146826   67649       .      .
      20 3724 3724 3724 3724    . 9653000 7185000 3457000 3313000      .
      end
      
      reshape long PRODUCTSEGMENT@SICCODE PRODUCTSEGMENT@SALES, i(NUM) j(WHICH)
      
      rename (PRODUCTSEGMENT*) (*)
      
      drop if missing(SALES)
      
      list
      
          +---------------------------------+
           | NUM   WHICH   SICCODE     SALES |
           |---------------------------------|
        1. |   1       1      5942   1688752 |
        2. |   1       2      5961    661374 |
        3. |   1       3      5722    547190 |
        4. |   1       4      7375    225117 |
        5. |   2       1      3728    419313 |
           |---------------------------------|
        6. |   2       2      3728    357788 |
        7. |   2       3      3728     97154 |
        8. |   3       1      2834   8177000 |
        9. |   3       2      3841   2929000 |
       10. |   3       3      3845   2778000 |
           |---------------------------------|
       11. |   3       4      2834   2088000 |
       12. |   4       1      4911   2729000 |
       13. |   4       2      4911   2466000 |
       14. |   4       3      4911   2444000 |
       15. |   4       4      4911   1688000 |
           |---------------------------------|
       16. |   5       1      3577   2464386 |
       17. |   5       2      5045    760004 |
       18. |   6       1      3841     22017 |
       19. |   6       2      8731      2879 |
       20. |   7       1      5651   1364853 |
           |---------------------------------|
       21. |   8       1      7363   4373244 |
       22. |   9       1      3674   3793962 |
       23. |   9       2      3674     97792 |
       24. |  10       1      2835      4640 |
       25. |  10       2      8731       700 |
           |---------------------------------|
       26. |  10       3      2834       633 |
       27. |  11       1      7372    543419 |
       28. |  11       2      7379    291886 |
       29. |  11       3      7372    289624 |
       30. |  11       4      7372    104791 |
           |---------------------------------|
       31. |  12       1      7372     12887 |
       32. |  13       1      3523   1470300 |
       33. |  13       2      3523    472200 |
       34. |  13       3      3523    250300 |
       35. |  13       4      3523    195300 |
           |---------------------------------|
       36. |  13       5      3523    153400 |
       37. |  14       1      2813   3944000 |
       38. |  14       2      2869   1522800 |
       39. |  14       3      3569    250400 |
       40. |  15       1      3841     17443 |
           |---------------------------------|
       41. |  15       2      8731     14946 |
       42. |  15       3      2834      9859 |
       43. |  16       1      4581   1973000 |
       44. |  16       2      4513     86300 |
       45. |  16       3      4522     81600 |
           |---------------------------------|
       46. |  17       1      2297    698187 |
       47. |  17       2      3442    102273 |
       48. |  17       3      2297     36236 |
       49. |  19       1      8399    272785 |
       50. |  19       2      8742    146826 |
           |---------------------------------|
       51. |  19       3      7379     67649 |
       52. |  20       1      3724   9653000 |
       53. |  20       2      3724   7185000 |
       54. |  20       3      3724   3457000 |
       55. |  20       4      3724   3313000 |
           +---------------------------------+
      Now it's just a standard application of collapse.

      EDIT: Andrew Musau has the same major point. I am more dogmatic that you need a really good reason to reshape wide back again.
      Last edited by Nick Cox; 05 Nov 2019, 07:26.

      Comment


      • #4
        Works like a charm, reshaping was indeed the thing I didn't think of. Thank you for your help!

        Comment

        Working...
        X