Unable to reproduce Pearson's X^2 using mgof and Benford's distribution

Canh Dang

Join Date: Apr 2015

Posts: 29
#1

Unable to reproduce Pearson's X^2 using mgof and Benford's distribution

02 Mar 2016, 15:49

I am interested in comparing the distribution of some observed counts and its theoretical distribution. One popular law is Benford's and Ben Jann wrote a Stata paper along with the command
mgof at http://www.stata-journal.com/sjpdf.h...iclenum=st0142

I am not sure if it is appropriate to post here but I would like to check if my understanding of the paper is right.

First, I was unable to get the digits.dta mentioned in the paper, following:

Code:

mata mata mlib index use digits, clear

Second, using the consequent example using

Code:

mgof firstdigit = log10(1+1/firstdigit), cr percent

The result is as follows (page 153)

I was trying to manually calculate the Pearson's X2 reported by mgof using the formula in page 152 as below:

But the manual calculation does not yield the same result as the reported one (2.000 compared with 6.2266). What I do is I take the observed frequency as f_j and the expected frequency as h_j, then use the provided formula.
This formula works fine with normal distribution and all examples listed in help mgof but somehow it does not work the example provided in this paper.

Have I done anything wrong?

My motivation for the manual calculation is that I feel visual graph and tests provided mgof regarding Benford's distribution are not inline. Using digdis another command following mgof, I produced this table of results:

I look at the observed and expected frequency and reckon that they are pretty close, in fact the MAD are small too. The produced graph also shows a resemblance between the observed and expected frequency. I expect the Pearson's X2 should not reject the null hypothesis that the data do follow Benford's distribution.

Yet, the reported Pearson's X2 is huge and the respective p-value is insignificant.

If I apply the formula as in the Stata article, I couldn't get the same reported Pearson's X2.

My questions are:

1. Can anyone please explain to me how the reported Pearson's X2 is calculated in mgof
2. What should I conclude from my dataset? From the graph, it is clear that the observed frequency follows the theoretical distribution, yet the Pearson's X2 concludes in the opposite way.

Many thanks in advance.

Last edited by Canh Dang; 02 Mar 2016, 16:05.
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10225

03 Mar 2016, 11:03

A simple example may help us figure out what is going on. Set up some hypothetical data (perhaps street numbers)

Code:

input street_number firstdigit
1001 1    
1002 1
1003 1
2000 2
2001 2
2002 2
2003 2
2004 2
2005 2
3001 3
3003 3
3007 3
4000 4
4111 4
4137 4
4777 4
4882 4
4991 4
end

Here you have two variables: 1) the actual street number, and 2) its first digit

Code:

. list, sepby( firstdigit)

     +---------------------+
     | street~r   firstd~t |
     |---------------------|
  1. |     1001          1 |
  2. |     1002          1 |
  3. |     1003          1 |
     |---------------------|
  4. |     2000          2 |
  5. |     2001          2 |
  6. |     2002          2 |
  7. |     2003          2 |
  8. |     2004          2 |
  9. |     2005          2 |
     |---------------------|
 10. |     3001          3 |
 11. |     3003          3 |
 12. |     3007          3 |
     |---------------------|
 13. |     4000          4 |
 14. |     4111          4 |
 15. |     4137          4 |
 16. |     4777          4 |
 17. |     4882          4 |
 18. |     4991          4 |
     +---------------------+

Now, we could replicate the bit of code in Jann (Stata Journal, 8, 2008, p. 153 )

Code:

*to install type net install mgof
*to install type ssc install moremata
mgof firstdigit = log10(1+1/firstdigit), cr percent

Obtaining the following output:

Code:

                  

. mgof firstdigit = log10(1+1/firstdigit), cr percent

                       Number of obs =      18
                       N of outcomes =       4
                       Chi2 df       =       3

----------------------------------------------
      Goodness-of-fit |       Coef.    P-value
----------------------+-----------------------
         Pearson's X2 |    8.322074     0.0398
 Log likelihood ratio |     7.77045     0.0510
   Cressie-Read (2/3) |    8.020086     0.0456
----------------------------------------------

  firstdigit |  observed   expected
-------------+----------------------
           1 |     16.67      43.07
           2 |     33.33      25.19
           3 |     16.67      17.87
           4 |     33.33      13.86
-------------+----------------------
       Total |    100.00     100.00

Now, the usual calculation of Pearson's Chi2 statistic taking the absolute values as shown in the table above can be achieved by defining the following matrix (your calculation by hand)

Code:

matrix P = (16.67\33.33\16.67\33.33),(43.07\25.19\17.87\13.86)
mgof, matrix(P)

Code:

. matrix P = (16.67\33.33\16.67\33.33),(43.07\25.19\17.87\13.86)

.
. mgof, matrix(P)

                       Number of obs =     100
                       N of outcomes =       4
                       Chi2 df       =       3

----------------------------------------------
      Goodness-of-fit |       Coef.    P-value
----------------------+-----------------------
         Pearson's X2 |    46.23909     0.0000
 Log likelihood ratio |    43.17198     0.0000
----------------------------------------------

However, the MGOF calculation of Pearson's Chi2 will take into consideration the frequencies for each of the outcomes and number of observations. In the output above, you see that the number of obs.= 100, implying that the entries in the first column of matrix P sum up to 100. With our data, we have N=18, implying that if we define our matrix, the entries must sum up to 18. The second constraint that we have is that for each entry in the first column, it must also reflect the frequency of that outcome. So if each outcome was equally likely, then each entry in the first column would be 18/4 = 4.5. However, this is not the case for us. We need to weight entries in row 2 and row 4 twice as those in rows 1 and 3 (because 16.67= 33.33/2).

Elementary calculus: Call the entry in row 1 and 3 "x". Then x+2x+x+2x=18 -> x= 3.

Define matrix Q:

Code:

matrix Q = (3\6\3\6),(43.07\25.19\17.87\13.86)
mgof, matrix(Q)

This yields the calculation of the statistic (not exact because of rounding, but close enough!)

Code:


. matrix Q = (3\6\3\6),(43.07\25.19\17.87\13.86)

.
. mgof, matrix(Q)

                    Number of obs =      18
                       N of outcomes =       4
                       Chi2 df       =       3

----------------------------------------------
      Goodness-of-fit |       Coef.    P-value
----------------------+-----------------------
         Pearson's X2 |    8.325926     0.0397
 Log likelihood ratio |    7.773569     0.0509
----------------------------------------------

ADDED ON

In summary, from the table that MGOF outputs and my data above, I add an extra 2 columns of which you can apply your hand-calculation to get the statistic

Code:

  firstdigit |  observed    expected   Counts     Expected
-------------+----------------------------------------------
           1 |     16.67      43.07       3   |     7.7526      
           2 |     33.33      25.19       6   |     4.5342      
           3 |     16.67      17.87       3   |     3.2166      
           4 |     33.33      13.86       6   |     2.4948    
-------------+--------------------------------+--------------
       Total |    100.00     100.00      18        18 


Where Expected = (expected/100)*18

Last edited by Andrew Musau; 03 Mar 2016, 11:59.

Announcement

Unable to reproduce Pearson's X^2 using mgof and Benford's distribution

Comment