Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unable to reproduce Pearson's X^2 using mgof and Benford's distribution

    I am interested in comparing the distribution of some observed counts and its theoretical distribution. One popular law is Benford's and Ben Jann wrote a Stata paper along with the command
    mgof at http://www.stata-journal.com/sjpdf.h...iclenum=st0142

    I am not sure if it is appropriate to post here but I would like to check if my understanding of the paper is right.

    First, I was unable to get the digits.dta mentioned in the paper, following:


    Code:
        
     mata mata mlib index  
    use digits, clear
    Second, using the consequent example using
    Code:
        
     mgof firstdigit = log10(1+1/firstdigit), cr percent 
    The result is as follows (page 153)
    Click image for larger version

Name:	Screen Shot 2016-03-02 at 22.29.08.png
Views:	1
Size:	277.4 KB
ID:	1329044

    I was trying to manually calculate the Pearson's X2 reported by mgof using the formula in page 152 as below:
    Click image for larger version

Name:	Screen Shot 2016-03-02 at 22.31.52.png
Views:	1
Size:	252.2 KB
ID:	1329042

    But the manual calculation does not yield the same result as the reported one (2.000 compared with 6.2266). What I do is I take the observed frequency as f_j and the expected frequency as h_j, then use the provided formula.
    This formula works fine with normal distribution and all examples listed in help mgof but somehow it does not work the example provided in this paper.

    Have I done anything wrong?

    My motivation for the manual calculation is that I feel visual graph and tests provided mgof regarding Benford's distribution are not inline. Using digdis another command following mgof, I produced this table of results:
    Click image for larger version

Name:	Screen Shot 2016-03-02 at 22.42.18.png
Views:	1
Size:	117.7 KB
ID:	1329050

    Click image for larger version

Name:	Graph.png
Views:	1
Size:	87.2 KB
ID:	1329047


    I look at the observed and expected frequency and reckon that they are pretty close, in fact the MAD are small too. The produced graph also shows a resemblance between the observed and expected frequency. I expect the Pearson's X2 should not reject the null hypothesis that the data do follow Benford's distribution.

    Yet, the reported Pearson's X2 is huge and the respective p-value is insignificant.

    If I apply the formula as in the Stata article, I couldn't get the same reported Pearson's X2.

    My questions are:

    1. Can anyone please explain to me how the reported Pearson's X2 is calculated in mgof
    2. What should I conclude from my dataset? From the graph, it is clear that the observed frequency follows the theoretical distribution, yet the Pearson's X2 concludes in the opposite way.


    Many thanks in advance.
    Last edited by Canh Dang; 02 Mar 2016, 16:05.

  • #2
    A simple example may help us figure out what is going on. Set up some hypothetical data (perhaps street numbers)

    Code:
    input street_number firstdigit
    1001 1    
    1002 1
    1003 1
    2000 2
    2001 2
    2002 2
    2003 2
    2004 2
    2005 2
    3001 3
    3003 3
    3007 3
    4000 4
    4111 4
    4137 4
    4777 4
    4882 4
    4991 4
    end
    Here you have two variables: 1) the actual street number, and 2) its first digit


    Code:
    . list, sepby( firstdigit)
    
         +---------------------+
         | street~r   firstd~t |
         |---------------------|
      1. |     1001          1 |
      2. |     1002          1 |
      3. |     1003          1 |
         |---------------------|
      4. |     2000          2 |
      5. |     2001          2 |
      6. |     2002          2 |
      7. |     2003          2 |
      8. |     2004          2 |
      9. |     2005          2 |
         |---------------------|
     10. |     3001          3 |
     11. |     3003          3 |
     12. |     3007          3 |
         |---------------------|
     13. |     4000          4 |
     14. |     4111          4 |
     15. |     4137          4 |
     16. |     4777          4 |
     17. |     4882          4 |
     18. |     4991          4 |
         +---------------------+

    Now, we could replicate the bit of code in Jann (Stata Journal, 8, 2008, p. 153 )


    Code:
    *to install type net install mgof
    *to install type ssc install moremata
    mgof firstdigit = log10(1+1/firstdigit), cr percent
    Obtaining the following output:

    Code:
                      
    
    . mgof firstdigit = log10(1+1/firstdigit), cr percent
    
                           Number of obs =      18
                           N of outcomes =       4
                           Chi2 df       =       3
    
    ----------------------------------------------
          Goodness-of-fit |       Coef.    P-value
    ----------------------+-----------------------
             Pearson's X2 |    8.322074     0.0398
     Log likelihood ratio |     7.77045     0.0510
       Cressie-Read (2/3) |    8.020086     0.0456
    ----------------------------------------------
    
      firstdigit |  observed   expected
    -------------+----------------------
               1 |     16.67      43.07
               2 |     33.33      25.19
               3 |     16.67      17.87
               4 |     33.33      13.86
    -------------+----------------------
           Total |    100.00     100.00 

    Now, the usual calculation of Pearson's Chi2 statistic taking the absolute values as shown in the table above can be achieved by defining the following matrix (your calculation by hand)


    Code:
    matrix P = (16.67\33.33\16.67\33.33),(43.07\25.19\17.87\13.86)
    mgof, matrix(P)


    Code:
    . matrix P = (16.67\33.33\16.67\33.33),(43.07\25.19\17.87\13.86)
    
    .
    . mgof, matrix(P)
    
                           Number of obs =     100
                           N of outcomes =       4
                           Chi2 df       =       3
    
    ----------------------------------------------
          Goodness-of-fit |       Coef.    P-value
    ----------------------+-----------------------
             Pearson's X2 |    46.23909     0.0000
     Log likelihood ratio |    43.17198     0.0000
    ----------------------------------------------
    However, the MGOF calculation of Pearson's Chi2 will take into consideration the frequencies for each of the outcomes and number of observations. In the output above, you see that the number of obs.= 100, implying that the entries in the first column of matrix P sum up to 100. With our data, we have N=18, implying that if we define our matrix, the entries must sum up to 18. The second constraint that we have is that for each entry in the first column, it must also reflect the frequency of that outcome. So if each outcome was equally likely, then each entry in the first column would be 18/4 = 4.5. However, this is not the case for us. We need to weight entries in row 2 and row 4 twice as those in rows 1 and 3 (because 16.67= 33.33/2).

    Elementary calculus: Call the entry in row 1 and 3 "x". Then x+2x+x+2x=18 -> x= 3.

    Define matrix Q:

    Code:
    matrix Q = (3\6\3\6),(43.07\25.19\17.87\13.86)
    mgof, matrix(Q)
    This yields the calculation of the statistic (not exact because of rounding, but close enough!)


    Code:
    
    . matrix Q = (3\6\3\6),(43.07\25.19\17.87\13.86)
    
    .
    . mgof, matrix(Q)
    
                        Number of obs =      18
                           N of outcomes =       4
                           Chi2 df       =       3
    
    ----------------------------------------------
          Goodness-of-fit |       Coef.    P-value
    ----------------------+-----------------------
             Pearson's X2 |    8.325926     0.0397
     Log likelihood ratio |    7.773569     0.0509
    ----------------------------------------------

    ADDED ON

    In summary, from the table that MGOF outputs and my data above, I add an extra 2 columns of which you can apply your hand-calculation to get the statistic


    Code:
      firstdigit |  observed    expected   Counts     Expected
    -------------+----------------------------------------------
               1 |     16.67      43.07       3   |     7.7526      
               2 |     33.33      25.19       6   |     4.5342      
               3 |     16.67      17.87       3   |     3.2166      
               4 |     33.33      13.86       6   |     2.4948    
    -------------+--------------------------------+--------------
           Total |    100.00     100.00      18        18 
    
    
    Where Expected = (expected/100)*18
    Last edited by Andrew Musau; 03 Mar 2016, 11:59.

    Comment

    Working...
    X