Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    How many distinct values are you trying to calculate? (Not the number of observations, but the number of categories.)

    Comment


    • #17
      Dear Nick,

      Here are my variables. Patnum is a unique number of a patent. Permno is a firm idetifier. A class variable is a 3-digit variable and there are over 400 classes in the U.S. Patent Classification System (https://en.wikipedia.org/wiki/United...Classification). I think that most of these classes are covered in my database.

      dataex patnum permno year class

      ----------------------- copy starting from the next line -----------------------
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long(patnum permno) int year str3 class
      1706123 10006 1921 "251"
      1579225 10006 1922 "137"
      1699538 10006 1922 "198"
      1605442 10006 1922 "164"
      1579247 10006 1922 "072"
      1699546 10006 1923 "403"
      1748147 10006 1923 "105"
      1727684 10006 1923 "220"
      1665389 10006 1923 "403"
      1665388 10006 1923 "105"
      1876807 10006 1923 "108"
      1605415 10006 1923 "267"
      1579325 10006 1923 "072"
      1605417 10006 1923 "403"
      1579268 10006 1923 "105"
      1631340 10006 1923 "187"
      1579234 10006 1923 "074"
      1649439 10006 1924 "384"
      1748114 10006 1924 "296"
      1649395 10006 1925 "105"
      1665392 10006 1925 "052"
      2085621 10006 1925 "105"
      1760688 10006 1925 "292"
      1626654 10006 1925 "105"
      1665368 10006 1925 "220"
      1649431 10006 1925 "005"
      1665407 10006 1925 "180"
      1626653 10006 1925 "105"
      1685132 10006 1925 "220"
      1616582 10006 1925 "105"
      1631309 10006 1925 "454"
      1665391 10006 1925 "137"
      1579214 10006 1925 "148"
      1685111 10006 1925 "220"
      1631313 10006 1926 "105"
      1631314 10006 1926 "105"
      1685126 10006 1926 "411"
      1605410 10006 1926 "295"
      1727638 10006 1926 "105"
      1649434 10006 1926 "188"
      end

      Edit: .

      by class, sort: gen nvals = _n == 1
      count if nvals
      430



      Best regards, Farid
      Last edited by Farid Mammadaliyev; 09 Jan 2019, 08:27.

      Comment


      • #18
        OK, but what was your entropyetc call?

        Comment


        • #19
          I used this formula: entropyetc class, by(permno year).

          HHI for a firm in a given year

          Comment


          • #20
            So, please do this

            Code:
            egen test = group(permno year) 
            
            su test, meanonly
            
            di r(max)
            and compare with what

            Code:
            help limits
            tells you is the limit for tabulate in your Stata.

            entropyetc internally uses tabulate, and that could be what is failing if you want thousands of results.

            Comment


            • #21
              Dear Nick,

              egen test = group(permno year)

              su test, meanonly

              di r(max)
              75101


              Tabulate for one-row table is 12000. Does it mean I cannot measure HHI for my class variable? Or there is a way to increase the limit of "tabulate".

              Best regards, Farid

              Comment


              • #22
                Well, you could get a job with StataCorp and edit the tabulate source code, but I don't think the developers work like that.

                Or you could rewrite entropyetc.ado.

                Or hope that I do that, or someone else. Which do you want?

                PS: So you want to use 75000 or so inequality measures.... I believe you, but it's a record for what I've seen.

                Comment


                • #23
                  Depending on the numbers of observation within each group, you could probably just

                  Code:
                  forvalues i = 1/75101 {
                      entropyetc ... if (test==`i') 
                  }
                  However, I wonder how you would want to look at 75,000 measures.

                  Best
                  Daniel

                  Comment


                  • #24
                    daniel klein Unfortunately it's not quite as easy as that in general. If all categories are represented in all groups, then that would work. Some measures calculated require explicit zeros for categories not represented in any group to be comparable across groups, and looking at each group independently can't ensure that such zeros are included.

                    But for HHI that should work.

                    Comment


                    • #25
                      Nick, thanks for the warning. I have not looked closely at the formulas but just ran a couple of toy examples in which separate HHI were equivalent to combined ones.

                      Best
                      Daniel

                      Comment


                      • #26
                        Farid Mammadaliyev daniel klein

                        This is a quick rewrite of entropyetc intended for larger datasets, called entropyetc_. This is for where you want many thousand results (in new variables, presumably). The tabulation will fail if there are too many categories, but (guessing again) you wouldn't really want it in that circumstance. There is no separate help file and it doesn't try to be a clone of entropyetc: for example, matrix output is not supported either.

                        Code:
                        *! 1.0.0 NJC 9 January 2019
                        *! entropyetc 2.0.0 NJC 5 July 2018
                        *! entropyetc 1.0.0 NJC 20 November 2016
                        program entropyetc_, rclass    
                                version 11.2
                                syntax varname [if] [in] [aweight fweight] [, by(varlist) Generate(str) Format(str) * ]
                        
                            quietly {
                                marksample touse, strok  
                                if "`by'" != "" markout `touse' `by', strok
                                count if `touse'
                                if r(N) == 0 error 2000
                        
                                if "`generate'" != "" parsegenerate `generate'
                                
                                tempvar group Shannon Simpson Shannon2 Simpson2 dissim categ total
                                    tempname recJ mylbl  
                        
                                if "`by'" != "" {
                                    egen long `group' = group(`by') if `touse', label
                                    compress `group'
                                    su `group', meanonly  
                                    local ng = r(max)
                                }    
                                else {
                                    gen byte `group' = `touse'
                                    local ng = 1
                                    label define `group' 1 "all"
                                    label val `group' `group'
                                }
                                    
                                foreach s in Shannon Simpson Shannon2 Simpson2 dissim {
                                    gen ``s'' = 0 if `touse'
                                }
                        
                                label var `Shannon'  "Shannon H"
                                label var `Shannon2' "exp(H)"
                                label var `Simpson'  "Simpson"
                                label var `Simpson2' "1/Simpson"
                                label var `dissim'   "dissim."
                        
                                egen long `categ' = group(`varlist')
                                compress `categ'
                                su `categ', meanonly
                                        local J = r(max)
                                        scalar `recJ' = 1/`J'
                                if "`exp'" == "" local exp 1
                        
                                gen `total' = 0
                                forval j = 1/`J' {
                                    tempvar p`j'
                                    bysort `group' : gen `p`j'' = sum(`exp' * `categ' == `j')
                                    by `group' : replace `p`j'' = `p`j''[_N]
                                    replace `total' = `total' + `p`j''
                                }
                        
                                forval j = 1/`J' {
                                    replace `p`j'' = `p`j'' / `total'
                                            replace `Shannon' = `Shannon' + max(0, -`p`j'' * ln(`p`j''))  
                                        replace `Simpson' = `Simpson' + `p`j''^2
                                    replace `dissim' = `dissim' + abs(`p`j'' - `recJ')
                            
                                }
                        
                                replace `Simpson2' = 1/`Simpson'
                                replace `Shannon2' = exp(`Shannon')  
                                replace `dissim' = `dissim'/2
                                
                                return scalar categories = `J'
                        
                                label var `group' "Group"
                                if "`format'" == "" local format "%4.3f"
                            }    
                                    
                            quietly if "`generate'" != "" {
                                local lbl1 "Shannon H"
                                local lbl2 "exp(H)"
                                local lbl3 "Simpson"
                                local lbl4 "1/Simpson"
                                local lbl5 "dissimilarity index"
                        
                                tokenize `Shannon' `Shannon2' `Simpson' `Simpson2' `dissim'  
                                forval j = 1/5 {
                                    if "`var_`j''" != "" {
                                        gen `var_`j'' = ``j''
                                        label var `var_`j'' "`lbl`j''"
                                    }
                                }
                            }
                        
                                capture noisily tabdisp `group' if `touse', ///
                            c(`Shannon' `Shannon2' `Simpson' `Simpson2' `dissim') ///
                            format(`format') `options'
                        end
                        
                        program parsegenerate
                            tokenize `0'
                            if "`6'" != "" {
                                di as err "generate() should specify 1 to 5 tokens"
                                exit 134
                            }
                        
                            forval j = 1/5 {
                                if "``j''" != "" {
                                    gettoken no rest : `j', parse(=)  
                                    capture numlist "`no'", max(1) int range(>=1 <=5)
                                    if _rc {
                                        di as err "generate() error: ``j''"
                                        exit _rc
                                    }
                        
                                    gettoken eqs rest : rest, parse(=)
                                    confirm new var `rest'
                                    c_local var_`no' "`rest'"
                                }
                            }  
                        end
                        Here is a test script:

                        Code:
                        set rmsg on
                        sysuse auto, clear
                        entropyetc rep78
                        entropyetc rep78, by(foreign)
                        
                        webuse nlsw88
                        entropyetc occupation, by(industry) gen(2=numeq)
                        egen tag = tag(industry)
                        graph dot (asis) numeq if tag, over(industry, sort(1) descending) linetype(line)
                        
                        sysuse auto, clear
                        entropyetc_ rep78
                        entropyetc_ rep78, by(foreign)
                        
                        webuse nlsw88
                        entropyetc_ occupation, by(industry) gen(2=numeq)
                        egen tag = tag(industry)
                        graph dot (asis) numeq if tag, over(industry, sort(1) descending) linetype(line)

                        Comment


                        • #27
                          Dear @daniel klein Nick Cox ,

                          I used these codes:

                          sort permno year

                          egen tag = tag(permno year class)

                          egen distinct = total(tag), by(permno year)

                          bys permno year: gen sum = (tag/distinct)^2

                          bys permno year: egen HHI = total(sum)

                          Dataex:

                          Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input long permno int year byte tag float(distinct sum HHI)
                          10006 1921 1 1 1 1
                          10006 1922 1 4 .0625 .25
                          10006 1922 1 4 .0625 .25
                          10006 1922 1 4 .0625 .25
                          10006 1922 1 4 .0625 .25
                          10006 1923 1 8 .015625 .125
                          10006 1923 1 8 .015625 .125
                          10006 1923 1 8 .015625 .125
                          10006 1923 0 8 0 .125
                          10006 1923 0 8 0 .125
                          10006 1923 1 8 .015625 .125
                          10006 1923 1 8 .015625 .125
                          10006 1923 1 8 .015625 .125
                          10006 1923 0 8 0 .125
                          10006 1923 0 8 0 .125
                          10006 1923 1 8 .015625 .125
                          10006 1923 1 8 .015625 .125
                          10006 1924 1 2 .25 .5
                          10006 1924 1 2 .25 .5
                          10006 1925 1 9 .01234568 .11111111
                          10006 1925 1 9 .01234568 .11111111
                          10006 1925 0 9 0 .11111111
                          10006 1925 1 9 .01234568 .11111111
                          10006 1925 0 9 0 .11111111
                          10006 1925 1 9 .01234568 .11111111
                          10006 1925 1 9 .01234568 .11111111
                          10006 1925 1 9 .01234568 .11111111
                          10006 1925 0 9 0 .11111111
                          10006 1925 0 9 0 .11111111
                          10006 1925 0 9 0 .11111111
                          10006 1925 1 9 .01234568 .11111111
                          10006 1925 1 9 .01234568 .11111111
                          10006 1925 1 9 .01234568 .11111111
                          10006 1925 0 9 0 .11111111
                          10006 1926 1 7 .020408163 .14285713
                          10006 1926 0 7 0 .14285713
                          10006 1926 1 7 .020408163 .14285713
                          10006 1926 1 7 .020408163 .14285713
                          10006 1926 0 7 0 .14285713
                          10006 1926 1 7 .020408163 .14285713
                          10006 1926 1 7 .020408163 .14285713
                          10006 1926 0 7 0 .14285713
                          10006 1926 0 7 0 .14285713
                          10006 1926 0 7 0 .14285713
                          10006 1926 0 7 0 .14285713
                          10006 1926 0 7 0 .14285713
                          10006 1926 1 7 .020408163 .14285713
                          10006 1926 1 7 .020408163 .14285713


                          Here the problem is tab variable does not report how time a distinct value appears in the group (firm year), therefore, HHI is not correct.

                          Nick, I think this warning for Daniel's solution, right?

                          Best regards, Farid

                          Comment


                          • #28
                            I can't follow #27. What tab does/would do is not obviously pertinent: it doesn't feature in your code.

                            Comment


                            • #29
                              Sorry Nick Cox , there should be "tag" instead of "tab", because the name of the variable I have created is "tag".

                              Comment


                              • #30
                                OK, but what is your question? You ask something about Daniel's solution, but I can't follow what you're asking. (Neither can he, it seems.)

                                Did you notice #26? It should work for your situation except that the final tabulation will fail. As said, I doubt that you want to scan a table with many thousand rows in any case.

                                Comment

                                Working...
                                X