Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • entropyetc available from SSC

    With thanks as always to Kit Baum, I flag that a new program entropyetc is available from SSC for Stata 11.2 up.

    The name entropyetc should be parsed "entropy, etc." and flags that it calculates Shannon entropy as one of a bundle of loosely related measures of diversity (concentration, inequality, heterogeneity, impurity, ...: the list of near synonyms in many literatures goes on and on).

    There are now many user-written Stata programs in what appears to be the same territory, but in fact most seem written with income inequality in mind, so that the data arrive as incomes for groups or individuals and that variable is treated as it comes. Such programs usually carry across to any additive variable.

    In contrast entropyetc is one of a smaller group of programs with main focus diversity (same comment) for categorical variables.

    The difference is sometimes slurred over. If an input variable is categorical, it can't be added usefully (or meaningfully). For a recent thread raising this point see http://www.statalist.org/forums/foru...milarity-index

    What can be added are frequencies, or more generally abundances, of categories.

    A near equivalent to entropyetc is divcat from Dirk Enzmann, also on SSC, announced at http://www.statalist.org/forums/foru...ailable-on-ssc

    The nearest equivalent is, however, my own ineq from 1998 (also on SSC). ineq assumes that the data are already summarized in terms of frequencies or other measures of abundance. Besides that, between 1998 and 2016 Stata has shifted and different coding is now both possible and natural. Rather than rewriting ineq drastically, it seemed better on reflection to leave the older program untouched. (Despite comments elsewhere, I am mindful of the need for old programs to remain accessible to the extent that that may be useful.)

    I've tried to write entropyetc so that it is easy to clone and to modify (which doesn't mean: to plagiarize!). One detail that is in fact central to the design is that most of these measures boil down to about one line of Mata: the main contribution of a program is to make it easy, or at least easier, to collate results from several different groups. I hope to write further on that in due course.

    Here are a couple of examples. First, we treat rep78 from the auto dataset as a categorical variable. Second, we look at diversity of occupations within industries in the nlsw88 dataset and underline that results can put into variables and become data for later analysis.

    Code:
    . sysuse auto
    (1978 Automobile Data)
    
    . entropyetc rep78
    
    ----------------------------------------------------------------------
        Group |  Shannon H      exp(H)     Simpson   1/Simpson     dissim.
    ----------+-----------------------------------------------------------
          all |      1.358       3.888       0.297       3.369       0.296
    ----------------------------------------------------------------------
    
    . entropyetc rep78, by(foreign)
    
    ----------------------------------------------------------------------
        Group |  Shannon H      exp(H)     Simpson   1/Simpson     dissim.
    ----------+-----------------------------------------------------------
     Domestic |      1.201       3.323       0.383       2.612       0.363
      Foreign |      1.004       2.730       0.388       2.579       0.457
    ----------------------------------------------------------------------
    
    . webuse nlsw88
    (NLSW, 1988 extract)
    
    . entropyetc occupation, by(industry) gen(2=numeq)
    
    ------------------------------------------------------------------------------------
                      Group |  Shannon H      exp(H)     Simpson   1/Simpson     dissim.
    ------------------------+-----------------------------------------------------------
      Ag/Forestry/Fisheries |      1.646       5.186       0.239       4.188       0.534
                     Mining |      0.562       1.755       0.625       1.600       0.846
               Construction |      1.399       4.050       0.353       2.832       0.597
              Manufacturing |      1.470       4.348       0.316       3.167       0.575
     Transport/Comm/Utility |      1.484       4.411       0.342       2.922       0.556
     Wholesale/Retail Trade |      1.740       5.698       0.214       4.681       0.554
    Finance/Ins/Real Estate |      1.206       3.340       0.355       2.818       0.707
        Business/Repair Svc |      1.579       4.849       0.277       3.608       0.588
          Personal Services |      1.597       4.937       0.243       4.107       0.599
      Entertainment/Rec Svc |      1.712       5.538       0.218       4.587       0.516
      Professional Services |      1.590       4.902       0.219       4.558       0.612
      Public Administration |      1.195       3.304       0.404       2.473       0.701
    ------------------------------------------------------------------------------------
    
    . egen tag = tag(industry)
    
    . graph dot (asis) numeq if tag, over(industry, sort(1) descending) linetype(line)


    Click image for larger version

Name:	entropyetc.png
Views:	1
Size:	13.9 KB
ID:	1365107


  • #2
    Thanks to Nick for pulling these together.

    As it happens, I've been working with these measures recently, and noted as has Nick, that various user-written programs for them exist. One additional thing that I would note is that for any measure that is a differentiable function of the multinomial p[i]s (e.g., the Simpson index) a relatively simple Delta Method standard error estimated can be calculated, of the form:

    se(M) = sqrt(dM * S * dM'), where

    M is the measure of interest, dM is a row vector in which dM[i] is the derivative of M with respect to p[i],
    and S is the variance-covariance matrix of the multinomial p[i], with S[i,j] = -p[i] * p[j]/N


    This approximation is easy to program, and, in my experience, works surprisingly well as compared to bootstrap results even with modest sized (say N = 150) samples. (There are cases, though, in which the approximation fails badly, e.g., for the Simpson index when the distribution is nearly uniform.)



    Comment


    • #3
      Mike: Thanks for that. My program doesn't support any kind of error calculation. I'd imagine bootstrapping as an alternative, but someone would perhaps best be advised to write a wrapper that returns the specific quantity of interest.

      Comment


      • #4
        entropyetc is now updated on SSC, thanks to Kit Baum. The main change is that I was a little dissatisfied with the internals, although users would have to work very hard to see any difference in the results. But this also makes public a fix made some time ago in my private files: previously entropyetc would fail with a string variable fed to by(). I noticed it that for myself some time ago but River Huang recently flagged the difficulty, making a public update a good idea.

        Comment


        • #5
          Dear Nick, Many thanks for the updates.

          Ho-Chuan (River) Huang
          Stata 17.0, MP(4)

          Comment


          • #6
            Dear Nick, Thank you for your updates. I have estimated entropy using entropyetc. However, my data is time series. So, i need to check the stability of the different entropy using the above code (Particularly Shannon H) over the time period in a rolling window framework with increments between successive rolling windows is 1 period. Then, in the second step, i want to plot the entropy value in the y-axis and time period in the x-axis. So, can you please help in writing the code. For data, you may use, sysuse tsline2.dta

            Thank you

            Comment


            • #7
              I can't see your code to help you with it.

              Here's some strategic help. Use rangestat (SSC) for the rolling part and take Mata code from entropyetc and marry the two.

              Good luck!

              Comment


              • #8
                Dear Nick, Thank you for your help. Here is the code i have used
                webuse lutkepohl2
                tsset qtr
                rolling, window(10): entropyetc dln_inv

                However, the Stata is reporting too many values
                an error occurred when rolling executed entropyetc
                r(134);


                Similarly, i have used other examples you have suggested, as follows
                webuse grunfeld, clear
                rangestat (entropyetc) invest , interval(year -6 0) by(company)


                However, Stata is reporting <istmt>: 3499 entropyetc() not found
                r(3499);


                So, i request your suggestion, if you will use lutkepohl2 time series data rather than a panel data, it will be better to understand.

                Thank you



                Comment


                • #9
                  Sorry, but there is some caprice in my answering at length, briefly, or at all on Statalist -- and here I can't answer at length given other commitments and I can only answer briefly.

                  For the record, I never suggested

                  Code:
                  rangestat (entropyetc) invest , interval(year -6 0) by(company)
                  and it's clear from studying the help forrangestat that that can't possibly work. You'll need to write some extra code, as I can't see that any one-liner will do what you want.


                  Comment


                  • #10
                    Finding some time to think about this I found it easier to work with rangerun (SSC).

                    Clearly you will need to change whatever nobs H to other variable names as needed or wished for your problem. The example uses windows of (at most) length 7 ending with the current observation, but again your choice is likely to be different. I show that results match those from entropyetc for the first and last complete windows in the toy dataset.

                    Code:
                    clear 
                    input whatever  
                    1 
                    2
                    3
                    4
                    5
                    6
                    7 
                    end 
                    expand whatever 
                    sort whatever 
                    gen t = 1989 + _n 
                    
                    list 
                    
                    capture program drop shannon_h 
                    program shannon_h 
                        tempname p 
                        tab whatever, matcell(`p') 
                        gen nobs = r(N) 
                        mata: p = st_matrix("`p'") 
                        mata: p = p :/ sum(p) 
                        mata: st_numscalar("H", -sum(p :* ln(p))) 
                        gen H = scalar(H) 
                    end 
                    
                    rangerun shannon_h, int(t -6 0) 
                    
                    qui entropyetc whatever in 1/7, gen(1=Hfirst) 
                    
                    qui entropyetc whatever in -7/L, gen(1=Hlast) 
                     
                    list , sep(7)
                    
                         +------------------------------------------------------+
                         | whatever      t   nobs          H     Hfirst   Hlast |
                         |------------------------------------------------------|
                      1. |        1   1990      1          0   1.277034       . |
                      2. |        2   1991      2   .6931472   1.277034       . |
                      3. |        2   1992      3   .6365142   1.277034       . |
                      4. |        3   1993      4   1.039721   1.277034       . |
                      5. |        3   1994      5    1.05492   1.277034       . |
                      6. |        3   1995      6   1.011404   1.277034       . |
                      7. |        4   1996      7   1.277034   1.277034       . |
                         |------------------------------------------------------|
                      8. |        4   1997      7   1.078992          .       . |
                      9. |        4   1998      7   1.004242          .       . |
                     10. |        4   1999      7   .6829081          .       . |
                     11. |        5   2000      7   .9556999          .       . |
                     12. |        5   2001      7   .9556999          .       . |
                     13. |        5   2002      7   .6829081          .       . |
                     14. |        5   2003      7   .6829081          .       . |
                         |------------------------------------------------------|
                     15. |        5   2004      7   .5982696          .       . |
                     16. |        6   2005      7   .7963116          .       . |
                     17. |        6   2006      7   .5982696          .       . |
                     18. |        6   2007      7   .6829081          .       . |
                     19. |        6   2008      7   .6829081          .       . |
                     20. |        6   2009      7   .5982696          .       . |
                     21. |        6   2010      7   .4101163          .       . |
                         |------------------------------------------------------|
                     22. |        7   2011      7   .4101163          .       0 |
                     23. |        7   2012      7   .5982696          .       0 |
                     24. |        7   2013      7   .6829081          .       0 |
                     25. |        7   2014      7   .6829081          .       0 |
                     26. |        7   2015      7   .5982696          .       0 |
                     27. |        7   2016      7   .4101163          .       0 |
                     28. |        7   2017      7          0          .       0 |
                         +------------------------------------------------------+

                    Comment


                    • #11
                      Thanks as ever to Kit Baum, entropyetc has now been updated on SSC. The latest version is version 3.

                      The previous version remains in the package as entropyetc2. The 2 arises because it was version 2.

                      Longish story short, the previous version ran into problems when users were working with thousands of categories. The program was using various commands and features that had various limits in various versions of Stata, namely Stata matrices, tabulate and tabdisp. There isn't a general fix or work-around short of re-writing the program with different syntax and different internal code. So, generate is now used for new variables and list for display.

                      A fresh look at the problem led me to drop the dissimilarity index and introduce calculation of the number of distinct categories.

                      To give some flavour, here are the results of running the examples in the help file. The graph is not shown here but is similar to that in #1.

                      Code:
                      . sysuse auto, clear 
                      (1978 automobile data)
                      
                      . entropyetc rep78, list
                      
                        +-----------------------------------------------------------+
                        |       distinct   Shannon H   exp(H)   Simpson   1/Simpson |
                        |-----------------------------------------------------------|
                        | all          5       1.358    3.888     0.297       3.369 |
                        +-----------------------------------------------------------+
                      
                      . entropyetc rep78, list by(foreign)
                      
                        +----------------------------------------------------------------+
                        |  foreign   distinct   Shannon H   exp(H)   Simpson   1/Simpson |
                        |----------------------------------------------------------------|
                        | Domestic          5       1.201    3.323     0.383       2.612 |
                        |  Foreign          3       1.004    2.730     0.388       2.579 |
                        +----------------------------------------------------------------+
                      
                      . 
                      . webuse nlsw88
                      (NLSW, 1988 extract)
                      
                      . entropyetc occupation, list by(industry) gen(3=numeq)
                      
                        +-------------------------------------------------------------------------------+
                        |                industry   distinct   Shannon H   exp(H)   Simpson   1/Simpson |
                        |-------------------------------------------------------------------------------|
                        |   Ag/Forestry/Fisheries          7       1.646    5.186     0.239       4.188 |
                        |                  Mining          2       0.562    1.755     0.625       1.600 |
                        |            Construction          7       1.399    4.050     0.353       2.832 |
                        |           Manufacturing          8       1.470    4.348     0.316       3.167 |
                        |  Transport/Comm/Utility          8       1.484    4.411     0.342       2.922 |
                        |-------------------------------------------------------------------------------|
                        |  Wholesale/Retail trade         10       1.740    5.698     0.214       4.681 |
                        | Finance/Ins/Real estate          5       1.206    3.340     0.355       2.818 |
                        |     Business/Repair svc          9       1.579    4.849     0.277       3.608 |
                        |       Personal services          8       1.597    4.937     0.243       4.107 |
                        |   Entertainment/Rec svc          7       1.712    5.538     0.218       4.587 |
                        |-------------------------------------------------------------------------------|
                        |   Professional services          7       1.590    4.902     0.219       4.558 |
                        |   Public administration          9       1.195    3.304     0.404       2.473 |
                        +-------------------------------------------------------------------------------+
                      (18 missing values generated)
                      
                      . egen tag = tag(industry)
                      
                      . graph dot (asis) numeq if tag, over(industry, sort(1) descending) ysc(alt) linetype(line) lines(lc(gs8) lw(vthin))
                      
                      . 
                      . webuse grunfeld, clear
                      
                      . entropyetc company [w=invest], list by(year)
                      (analytic weights assumed)
                      
                        +------------------------------------------------------------+
                        | year   distinct   Shannon H   exp(H)   Simpson   1/Simpson |
                        |------------------------------------------------------------|
                        | 1935         10       1.606    4.985     0.286       3.502 |
                        | 1936         10       1.584    4.875     0.283       3.535 |
                        | 1937         10       1.620    5.052     0.273       3.666 |
                        | 1938         10       1.730    5.640     0.242       4.134 |
                        | 1939         10       1.667    5.294     0.265       3.772 |
                        |------------------------------------------------------------|
                        | 1940         10       1.601    4.957     0.280       3.569 |
                        | 1941         10       1.652    5.217     0.263       3.798 |
                        | 1942         10       1.606    4.985     0.277       3.605 |
                        | 1943         10       1.597    4.938     0.285       3.507 |
                        | 1944         10       1.660    5.260     0.276       3.622 |
                        |------------------------------------------------------------|
                        | 1945         10       1.698    5.465     0.266       3.757 |
                        | 1946         10       1.660    5.259     0.267       3.742 |
                        | 1947         10       1.709    5.523     0.250       4.005 |
                        | 1948         10       1.732    5.654     0.240       4.160 |
                        | 1949         10       1.683    5.379     0.259       3.862 |
                        |------------------------------------------------------------|
                        | 1950         10       1.644    5.178     0.272       3.672 |
                        | 1951         10       1.712    5.540     0.248       4.034 |
                        | 1952         10       1.693    5.435     0.257       3.895 |
                        | 1953         10       1.614    5.025     0.292       3.424 |
                        | 1954         10       1.532    4.627     0.337       2.966 |
                        +------------------------------------------------------------+

                      Comment


                      • #12
                        Why did you choose to drop the dissimilarity index?
                        ---------------------------------
                        Maarten L. Buis
                        University of Konstanz
                        Department of history and sociology
                        box 40
                        78457 Konstanz
                        Germany
                        http://www.maartenbuis.nl
                        ---------------------------------

                        Comment


                        • #13
                          I've lost interest in it in this context. Also, it's awkward to calculate as you need to keep track of zeros in categories that might have occurred but didn't in some subset.

                          All the measures included are related to the family SUM p^a (ln 1/p)^b for probabilities p different a and b, which will become more prominent if I ever write up a longer paper on this topic.
                          Last edited by Nick Cox; 13 Jan 2024, 02:36.

                          Comment

                          Working...
                          X