Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Computing the Sum of Squared Errors between an Empirical Distribution Function and a estimated parametric function

    I am trying to evaluate the goodness of fit between an Empirical Distribution function and parametric distributions whose parameters were estimated and fitted via Maximum Likelihood.

    My problem is that I have a variable x in the data set. I fitted the values of the variable using ML techniques to estimate the correspondent parameters of a Lognormal distribution. Now, I want to compute the SSE between the EDF and the estimated Lognormal distribution. I would like to know if you have any suggestion to compute this value.

  • #2
    Two options:
    1) If you wish to test the equality of the two distributions. I would use the Kolmogorov-Smirnov distance test using the
    Code:
     ksmirnov x1=x2
    command, which is generally a test of equality of distribution functions. The test is related to the sup of the set of absolute differences generated by the two CDFs, based on the support of the random variable.
    2) If you wish to do a simple squared sum of errors, consider that (assuming the support is defined in the same way for both distributions)

    Code:
    gen Diff=X1-X2
    gen Diff2=Diff^2
    egen SSE=sum(Diff^2)
    wherein each cell of X1 and X2 corresponds to the value of the distribution function evaluated at the value in the support the cell corresponds to.

    Comment


    • #3
      To me this idea is the wrong way round. You aren't, or shouldn't, usually be thinking of predicting the cumulative distribution function --- unless you have a good story in which it is scientifically or substantively the curve of most interest. If you are, then you will find that lousy models produce small sums of squared errors. Thinking in terms of predicting the quantile function is typically more fruitful. Any quantile-quantile plot can be quantified in terms of prediction error.

      Also, the issue will not usually be just one of whether the lognormal is a good fit but of whether alternative models are better.

      Comment


      • #4

        Sorry if I was not clear in my question. I have a variable called ipcf in my data set which represent an income distribution. As first attempt I tried to estimate the SSE, but the values I got are too high. My code so far looks like :
        insheet using "sampletstgof.txt" ,clear
        * I have the variable ipcf which is the "true distribution" that i wanted to compare
        preserve
        contract ipcf
        qui summ _f
        *Obtain the EDF as the cum sum of freq
        gen edf = _f/r(sum)
        save temp_1, replace
        restore
        cap drop _*
        merge m:1 ipcf using temp_1, nogen
        *Singh-Manddala Distribution
        smfit ipcf, cdf("sm_cdf") pdf("sm_pdf") stats
        *Lognormal Distribution
        lognfit ipcf,cdf("logn_cdf") pdf("logn_pdf") stats
        *Dagum Distribution
        dagumfit ipcf, cdf("dagum_cdf") pdf("dagum_pdf") stats
        * Generalized Beta (GB2) Distribution
        gb2fit ipcf,cdf("gb2_cdf") pdf("gb2_pdf") stats
        * I want to estimate the goodness of fit, so Q-Q plots
        qdagum ipcf
        qsm ipcf
        qgb2 ipcf
        qlogn ipcf, param(5.027258 .9192615 )
        *Sorry I have so many questions, but the literatura around this issue isnĀ“t very explicative. I have two questions:
        *1) How can I calcultate the SSE between the parametric distributions and the ipcf?
        *2) How can I have an statistical test (Kolmogorov-Smirnof like) to demonstrate which distribution fits better?


        Attached Files

        Comment


        • #5
          You're using several community-contributed commands there. That's fine, but you're asked to explain their provenance (FAQ Advice #12).

          My attitude is that it's often clear from the quantile plots that one distribution fits better than the others -- and often clear also that none of them work well.

          For more on that see e.g. https://stats.stackexchange.com/ques.../140625#140625

          For very skewed distributions, there is often a judgement call on how far it's important (or you expect) a good fit in the far tail of the distribution.

          In terms of code, you can dig into e.g. qdagum.ado and find code for calculating fitted quantiles for the Dagum, and so forth.

          Alternatively, the fitting commands you mention all use maximum likelihood, so that gives you a definable criterion.

          PS Maddala, not Manddala.

          Comment


          • #6
            Thank you very much Nick for all your help

            Comment

            Working...
            X