Computing the Sum of Squared Errors between an Empirical Distribution Function and a estimated parametric function

Francisco Mejia

Join Date: May 2016

Posts: 18
#1

Computing the Sum of Squared Errors between an Empirical Distribution Function and a estimated parametric function

26 Jun 2018, 07:41

I am trying to evaluate the goodness of fit between an Empirical Distribution function and parametric distributions whose parameters were estimated and fitted via Maximum Likelihood.

My problem is that I have a variable x in the data set. I fitted the values of the variable using ML techniques to estimate the correspondent parameters of a Lognormal distribution. Now, I want to compute the SSE between the EDF and the estimated Lognormal distribution. I would like to know if you have any suggestion to compute this value.
Tags: maximum likelihood
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#2

26 Jun 2018, 08:21

Two options:
1) If you wish to test the equality of the two distributions. I would use the Kolmogorov-Smirnov distance test using the

Code:

ksmirnov x1=x2

command, which is generally a test of equality of distribution functions. The test is related to the sup of the set of absolute differences generated by the two CDFs, based on the support of the random variable.
2) If you wish to do a simple squared sum of errors, consider that (assuming the support is defined in the same way for both distributions)

Code:

gen Diff=X1-X2 gen Diff2=Diff^2 egen SSE=sum(Diff^2)

wherein each cell of X1 and X2 corresponds to the value of the distribution function evaluated at the value in the support the cell corresponds to.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

26 Jun 2018, 09:50

To me this idea is the wrong way round. You aren't, or shouldn't, usually be thinking of predicting the cumulative distribution function --- unless you have a good story in which it is scientifically or substantively the curve of most interest. If you are, then you will find that lousy models produce small sums of squared errors. Thinking in terms of predicting the quantile function is typically more fruitful. Any quantile-quantile plot can be quantified in terms of prediction error.

Also, the issue will not usually be just one of whether the lognormal is a good fit but of whether alternative models are better.
1 like
Comment
Francisco Mejia

Join Date: May 2016

Posts: 18
#4

26 Jun 2018, 12:42

Sorry if I was not clear in my question. I have a variable called ipcf in my data set which represent an income distribution. As first attempt I tried to estimate the SSE, but the values I got are too high. My code so far looks like :
insheet using "sampletstgof.txt" ,clear
* I have the variable ipcf which is the "true distribution" that i wanted to compare
preserve
contract ipcf
qui summ _f
*Obtain the EDF as the cum sum of freq
gen edf = _f/r(sum)
save temp_1, replace
restore
cap drop _*
merge m:1 ipcf using temp_1, nogen
*Singh-Manddala Distribution
smfit ipcf, cdf("sm_cdf") pdf("sm_pdf") stats
*Lognormal Distribution
lognfit ipcf,cdf("logn_cdf") pdf("logn_pdf") stats
*Dagum Distribution
dagumfit ipcf, cdf("dagum_cdf") pdf("dagum_pdf") stats
* Generalized Beta (GB2) Distribution
gb2fit ipcf,cdf("gb2_cdf") pdf("gb2_pdf") stats
* I want to estimate the goodness of fit, so Q-Q plots
qdagum ipcf
qsm ipcf
qgb2 ipcf
qlogn ipcf, param(5.027258 .9192615 )
*Sorry I have so many questions, but the literatura around this issue isn´t very explicative. I have two questions:
*1) How can I calcultate the SSE between the parametric distributions and the ipcf?
*2) How can I have an statistical test (Kolmogorov-Smirnof like) to demonstrate which distribution fits better?

Attached Files

sampletstgof.txt (5.0 KB, 1 view)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

27 Jun 2018, 05:25

You're using several community-contributed commands there. That's fine, but you're asked to explain their provenance (FAQ Advice #12).

My attitude is that it's often clear from the quantile plots that one distribution fits better than the others -- and often clear also that none of them work well.

For more on that see e.g. https://stats.stackexchange.com/ques.../140625#140625

For very skewed distributions, there is often a judgement call on how far it's important (or you expect) a good fit in the far tail of the distribution.

In terms of code, you can dig into e.g. qdagum.ado and find code for calculating fitted quantiles for the Dagum, and so forth.

Alternatively, the fitting commands you mention all use maximum likelihood, so that gives you a definable criterion.

PS Maddala, not Manddala.
Comment
Francisco Mejia

Join Date: May 2016

Posts: 18
#6

27 Jun 2018, 05:49

Thank you very much Nick for all your help
Comment

Announcement

Computing the Sum of Squared Errors between an Empirical Distribution Function and a estimated parametric function

Comment

Comment

Comment

Comment

Comment