Problems with fitting GB distribution by MLE to LIS unit data

Sami Remes

Join Date: Jul 2016

Posts: 2
#1

Problems with fitting GB distribution by MLE to LIS unit data

28 Jul 2016, 04:59

Dear all,

Im trying to fit different GB distributions to LIS unit data income datasets. The problem I keep facing is that some of the distributions do not seem to fit to the income data.

When fitting different distributions, an interesting situation occurs where the unrestricted model doesn't find a solution but the restricted model does (special case of the higher parameter distribution). I was wondering if someone could elaborate why would this happen? Naturally I'm very interested to learn about ways to try to fix the problems I am having.

Consistently, I keep having issues with
4-par GB2 and GB1, where GB2 solution is found in about half of the datasets and for GB1 only in a handful of ~160datasets.
3par Beta1 and Beta2 with similar or less success compered to GB1 and GB2.

Interestingly 2parameter distributions (lognormal, fisk, gamma, weibull) seem to fit in more or less 100% of the datasets considered aswell.
A solution is found for all Dagum fits and in most cases for SM distribution fits.

Im using STATA commands fiskfit, lognfit, gammafit, weibullfit, dagumfit, smfit and gb2fit (Thanks to Buis, Cox and Jenkins) as well as distribution commands modified by myself for the other distributions that I have based on the code of gb2fit. I also have tried a number of different tech() options as well as without difficult setting. Some settings seem to favor some datasets but mostly marginal changes. Same with the income scale changes between thousands units and units).

The error messages I get are (depending on distribution and dataset)
- could not find feasible values
- could not calculate numerical derivatives
missing values encountered
- convergence not achieved

My full do-file is about 1500 rows long but here are the most important parts

. program gb1fit_ll
1. version 8.1
2. args lnf a b p q
3. qui replace `lnf' = ln(`a') + (`a' * `p' - 1) * ln($S_MLy) + (`q'-1) * ln(1-($S_MLy/`b')^(`a')) - (`a' * `p')*ln(`b') - lngamm
> a(`p') - lngamma(`q') + lngamma(`p' + `q')
4. end

. program beta1fit_ll
1. version 8.1
2. args lnf b p q
3. qui replace `lnf' = (`p' - 1) * ln($S_MLy)+(`q' - 1) * (ln(1 - $S_MLy / `b'))-(`p') * ln(`b') - lngamma(`p') - lngamma(`q') +
> lngamma(`p'+`q')
4. end

. program beta2fit_ll
1. version 8.1
2. args lnf b p q
3. qui replace `lnf' = (`p' - 1) * ln($S_MLy) - (`p') * (ln (`b')) - lngamma(`p') - lngamma(`q') + lngamma(`p' + `q') - (`p' + `q
> ')*ln(1+$S_MLy/`b')
4. end

. program gengammafit_ll
1. version 8.1
2. args lnf a b p
3. qui replace `lnf' = ln(`a') + (`a'*`p'-1)*ln($S_MLy) - (`a'*`p')*ln(`b')-lngamma(`p')-(($S_MLy/`b')^(`a'))
4. end

local distrlist fisk logn gamma weibull dagum beta1 beta2 gengamma sm gb1 gb2

foreach distr in `distrlist' {
capture : `distr'fit $yvar if $yvar>0 [aw=we], difficult iterate(1000)
...
..
.

Resulting with a .csv

One of the worst fitting dataset (Finnish 2004)
dataset;Distribution;Mean;Log-L;SSE;SAE;Chi2;par1;par2;par3;par4;error_dummy
fi04;dagum;34995.552;-107011.82;.0029263063287779;.1762591078877449;484. 0472143648566;4.9403716;69296.152;.16656719;.;0
fi04;beta1;.;.;0;0;0;.;.;.;.;1
fi04;beta2;.;-107596.2;.0087654317035373;.3436714746057987;1483. 029804663156;7.659e+09;1.1169577;237518.22;.;0
fi04;gengamma;.;.;0;0;0;.;.;.;.;1
fi04;sm;.;-107502.13;.0070364654916375;.2907900214195252;1354 .312556440797;1.1499537;1.724e+09;229316.27;.;0
fi04;gb1;.;.;0;0;0;.;.;.;.;1
fi04;gb2;.;.;0;0;0;.;.;.;.;1

And one of the best fitting (Australia 1995)
dataset;Distribution;Mean;Log-L;SSE;SAE;Chi2;a;b;c;d;error_dummy
au95;fisk;53.698438;-21717.873;.0037656504070043;.2073850817978382;338. 1778163919059;2.4879985;40.524231;.;.;0
au95;logn;51.782965;-22019.962;.0088273451336249;.2844472713768482;547. 3337241130193;3.6415797;.78164124;.;.;0
au95;gamma;48.697075;-21684.452;.0026782182983851;.1625073105096817;191. 0720707902185;22.124455;2.2010519;.;.;0
au95;weibull;48.854015;-21777.489;.0038584757592162;.2187049686908722;295. 8086521186779;54.102029;1.4965573;.;.;0
au95;dagum;48.393966;-21548;.0012784655802562;.1206132508814335;116.1812 030004812;3.74688;61.238609;.43080483;.;0
au95;beta1;.;.;0;0;0;.;.;.;.;gini;1
au95;beta2;48.640651;-21665.653;.0028560391988321;.1681783348321915;199. 9628612559609;418.71786;2.4345877;21.957888;.;0
au95;gengamma;45.513946;-21683.456;.002779269434626;.165854673832655;195.93 96153486823;.94191628;18.608602;2.4458552;.;0
au95;sm;48.398556;-21586.624;.0016310668069544;.1278458423912525;131. 1987107450299;1.9813232;78.95848;2.8145186;.;0
au95;gb1;49.103449;-21752.068;.0498044795822352;.9949923679232597;2282 1302.75527501;.13384447;919.34325;34.801775;18.214 408;0
au95;gb2;48.485684;-21547.726;.0503780724247918;.9994861483573914;-5976257.133697152;3.9752689;60.125643;.40234381;.9 0979795;0

Any advice is very welcome, thanks in advance

-Sami

Last edited by Sami Remes; 28 Jul 2016, 05:03.
Tags: None
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#2

28 Jul 2016, 07:20

Don't expect that adding more parameters to your model will necessarily make it easier to fit. When you add parameters, you are effectively trying to squeeze more and more out of the same 'information' in the data. That may not be possible. Success may also depend on precisely what the definition of income distribution is. [Remember too that no real-world distribution actually looks like the smoothed pictures produced by these models.]

To try and get more of the 4 parameter models to fit, you might do things like:
* try -gb2lfit- on SSC (different metric for the parameters in estimation than -gb2fit-)
* work hard on improving your starting values. E.g. for GB2, you can use your estimates from Singh-Maddala and Dagum distributions to assist you
* use the options to maximize to pump out more information at each iteration -- see if there is useful information (e.g. about which parameter is problematic and where it's getting stuck)

You appear to using a shotgun approach, trying "everything". I suggest that you be more focused. In any case, don't re-invent any wheels. My recollection is that Louis Chavel (U Luxembourg) has already done a similar exercise to you. See e.g. his website and recent Review of Income and Wealth paper.

Finally, welcome to Statalist, but please please read the forum FAQ before posting again. Note e.g. the injunctions to tell us precisely where you got user-written programs from, and also to post input and output within CODE delimiters so that it is legible to all readers
Comment
Sami Remes

Join Date: Jul 2016

Posts: 2
#3

04 Aug 2016, 01:58

Thank you Prof. Jenkins for your insight in the matter of the estimations. The -gb2lfit- on SSC seems to really make my work considerably easier. I have been working towards your 2nd suggestion on improving my starting values. It's a slow process to improve the code to run as smoothly as possible over numerous datasets on the LIS database, especially with the limited knowledge of Stata programming. Thanks also for pointing me to check Prof. Chauvel's work. Definitely got some interesting ideas for further research.
Comment
James McDonald

Join Date: Sep 2016

Posts: 1
#4

08 Sep 2016, 14:43

Just a follow up on an earlier suggestion. I have found it very helpful to start at the bottom and work up the distribution tree. Sometimes there is no improvement in moving up. For example, if the data are accurately modeled with a lognormal, attempting to go to the generalized gamma will result in the "q" parameter making huge increases without an improved fit. Similarly, if the Dagum or Burr distribution provides a great fit, the GB2 or GB will not yield significant improvments and might be viewed as being an example of over specification.
Comment

Announcement

Problems with fitting GB distribution by MLE to LIS unit data

Comment

Comment

Comment