Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question: Industry dummy for 6 most frequent industries

    Hi,

    As part of my master's progam in Accountancy, I am doing a replication study. I was wondering if anyone would be able to give me some advice on the following topic.

    One of the variables that I have to create from the data is an industry dummy. This dummy should have the value "1" if the firm-year has a industry code (sic2) which is among the 6 most frequent industry codes in the dataset, and "0" otherwise.

    My data:
    Code:
    * Example generated by -dataex-. 
    clear
    input str10 tic long gvkey2 double fyear float sic2
    "PNW"   1075 2013 49
    "PNW"   1075 2014 49
    "PNW"   1075 2015 49
    "ABT"   1078 2013 38
    "ABT"   1078 2014 38
    "ABT"   1078 2015 38
    "AET"   1177 2013 63
    "HON"   1300 2013 99
    "HON"   1300 2014 99
    "HON"   1300 2015 99
    "AEP"   1440 2013 49
    "AEP"   1440 2014 49
    "AEP"   1440 2015 49
    end
    Currently, I am using the following commands to create the industry dummy:
    Code:
    groups sic2, order(h) select(6)
    /// From the table can be obtained that the six most frequent sic codes are 49, 28, 73, 60, 38 and 63.
    gen D_IND = 0
    foreach i in 28 38 49 60 63 73  {
        replace D_IND = 1 if sic2 == `i'
    }
    However, ideally, I would like to use the output from "groups" automatically in my loop. So, I don't have to type the relevant sic2 codes in the loop myself (as this seems inefficient and has to been changed manually if my data changes).
    Is there a way to do this?

    For your information, I am using Stata 16.1.
    I hope this gives you enough information about my question. Please let me know if you can help me out!

    Kind regards,
    Lianne

  • #2
    Welcome to Statalist and thanks for following the question guideline.

    Code:
    * Generate a sequence in case that matters:
    gen seq = _n
    
    bysort sic2: egen totalcount = count(sic2)
    bysort sic2: replace totalcount = . if _n != 1
    gsort -totalcount
    
    * I used 3 here due to small data, you can change that to 6
    gen temp_rank = (_n <= 3)
    * You can also do more sophisticated thing like -egen rank-. With tied rank
    * this process above can be tricky. Visual check after -gsort- above recommended.
    * Or, use -list totalcount in 1/20- to print that onto the output or log.
    
    egen top_x = max(temp_rank), by(sic2)
    
    gsort seq
    drop totalcount temp_rank seq
    list, sep(0)
    Results:
    Code:
         +-------------------------------------+
         | tic   gvkey2   fyear   sic2   top_x |
         |-------------------------------------|
      1. | PNW     1075    2013     49       1 |
      2. | PNW     1075    2014     49       1 |
      3. | PNW     1075    2015     49       1 |
      4. | ABT     1078    2013     38       1 |
      5. | ABT     1078    2014     38       1 |
      6. | ABT     1078    2015     38       1 |
      7. | AET     1177    2013     63       0 |
      8. | HON     1300    2013     99       1 |
      9. | HON     1300    2014     99       1 |
     10. | HON     1300    2015     99       1 |
     11. | AEP     1440    2013     49       1 |
     12. | AEP     1440    2014     49       1 |
     13. | AEP     1440    2015     49       1 |
         +-------------------------------------+
    Last edited by Ken Chui; 18 Jun 2021, 07:21.

    Comment


    • #3
      Hi Ken,

      Thank you! For both the answer to my question and your fast reply.
      This will help me a lot, so thanks again!


      Best,
      Lianne

      Comment

      Working...
      X