K-S test for Pareto distributed data.

Moritz Huth

Join Date: Mar 2023

Posts: 14
#1

K-S test for Pareto distributed data.

14 Mar 2023, 14:24

Hello everyone,

for my master's thesis I am working with a survey dataset on wealth distribution where I am re-estimating the top tail of the distribution, i.e. the richest households. Since I assume that my wealth data follows a Pareto distribution, I need to find xmin, e.g. the right scale parameter in order to identify the cut-off-point. I have chosen the upper 10 percentiles of my wealth variable for possible xmins. For these 10 percentiles I calculated the respective CCDFs and estimated the 10 possible shape parameters "alpha".

I now want to perform a Kolgomorov-Smirnov test to identify the correct xmin using the p-values. If I understand the test correctly, it is primarily intended to test non-normal distributions. The assumed empirical (in my case Pareto) distribution is compared to a hypothetical distribution and the distance between both is calculated. If H0 cannot be rejected, the hypothetical fits the empirical model.

"Help ksmirnov" says:

Code:

ksmirnov varname = exp [if] [in]

where varname is the variable whose distribution is being tested, and exp must evaluate to the corresponding (theoretical) cumulative. In the example, they test for normal distribution and do so by putting in the function "normal(mean/sd)" for exp. Here's where I'm stuck and I think I'm lacking some theoretical background. I assumed that I would use a pareto function in the same way. And I would do that for all 10 percentiles of my wealth distribution. However, Stata only offers "rpareto", whereby nothing useful comes up. What do I have to put in for exp to test my data for pareto distribution? Do I have to calculate a hypothetical Pareto distribution to test my data against? If so, how would I do that?

Thank you and kind regards
Moritz
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

14 Mar 2023, 17:38

I can't follow what you're doing here. The minimum value for the simplest kind of Pareto distribution is in my experience just taken to be the minimum observed value and then the other parameter is (also) estimated by maximum likelihood. How the upper [?] percentiles appear is unclear. After that I wouldn't use a Kolmogorov-Smirnov test at all, as it tends to have the wrong kind of sensitivity and the usual procedure makes no allowance for parameters being estimated from the data. I'd use a quantile-quantile plot. That said, this distribution is of interest only for right-skewed distributions in which outliers are likely and good fits are often difficult or even controversial given many other candidate distributions s with different brand names.

paretofit from SSC by Stephen Jenkins and Philippe Van Kerm is a tool of choice.
Comment
Moritz Huth

Join Date: Mar 2023

Posts: 14
#3

20 Mar 2023, 05:43

My explanation might be a bit sloppy, but basically it's what Stephen Jenkins did on p.12 in "Jenkins, Stephen P. (2016) Pareto models, top incomes, and recent trends in UK income inequality. Economica. ISSN 0013-0427" and what he also describes in this post from a couple of years ago https://www.statalist.org/forums/for...g-alpha-and-x0. There he states: "Fit your model using -paretofit- for multiple thresholds over a plausible range, and from this, calculate the K-S statistics." I already did the first part and obtained multiple estimated Pareto alphas for my possible thresholds. The second part is what troubles me as I'm not sure how to translate this step into Stata.

To compare the empirical with the theoretical cdf, my thinking here is, based on my wealth variable, I sort the data in acceding order and then use -cumul- to create the empirical cdfs for my 10 thresholds. The theoretical cdf I built by using the formula for the same thresholds:

Code:

cdf = 1-(xmin/x)^alpha

However, the test results I obtained let me believe that I did something wrong. Here are the results for the first first threshold, all other results are identical in p-values and only slightly differ in the test statistics D for ecdf_* and combined Combined K-S.

Code:

One-sample Kolmogorov–Smirnov test against theoretical distribution tcdf_1_1 Smaller group D p-value --------------------------------------- ecdf_1_1 1.0122 0.000 Cumulative 0.0000 1.000 Combined K-S 1.0122 0.000

Here is the code I used for creating the ecdfs and tcdfs, I assume the error is somewhere in there (note that I keep using the 1/5 outer loop, as I have an imputed dataset):

Code:

// create ecdfs forvalues i = 1/5 { gsort -rank_`i' forvalues j = 1/10 { cumul netwealth_`j'_`i', gen(ecdf_`j'_`i') } } // theoretical cdfs local q = 1 forvalues s = 1/5 { gsort -rank_`q' gen running_sum_`s' = sum(weight) by percent_`q' (rank_`q'), sort: gen group_total_`s' = running_sum_`s'[_N] local q = `q' + 1 } local j = 1 local k = 1 foreach var of varlist percent_* { gsort -rank_`j' forvalues i = 1/10 { gen ni_`i'_`k' = running_sum_`j' if `var' <= `i' egen nall_`i'_`k' = min(cond(`var' == `i', group_total_`j', .)) forvalues q = 1/5 { local alpha_ij = alphas_`q'[`i',1] } gen tcdf_`i'_`k' = 1-(ni_`i'_`k'/nall_`i'_`k')^(`alpha_ij') } local j = `j' + 1 local k = `k' + 1 } // KS Test forvalues i = 1/5 { forvalues j = 1/10 { ksmirnov ecdf_`j'_`i' = tcdf_`j'_`i' } }

any help is much appreciated
Thank you
1 like
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#4

20 Mar 2023, 06:38

The "K-S test" in my paper is the test with that label described by Clauset et al. (2009) in the context of choosing a threshold (i.e., no originality from me). See their paper. You need to calculate statistics like "max |S_empirical - S_predicted|", where S refers to survival function, for a range of thresholds and eyeball those (as Clauset et al.) "predicted" is dervied from fitted model given that threshold.

Put differently, I never used -ksmirnov-
3 likes
Comment
Moritz Huth

Join Date: Mar 2023

Posts: 14
#5

22 Mar 2023, 09:37

You are right, I'm sorry, I should have put it more precisely that you didn't use the test and were only describing what Clauset et al. propose. I did read their paper but and I get the overall idea.
In order not to cause any further confusion, I want to narrow my question down to this:

If I want to perform the KS Test, how do I calculate the theoretical CDF? If 1-(xmin/x)^alpha is the formula, what do I put in for xmin and x?
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#6

23 Mar 2023, 03:11

S_predicted is derived from the Pareto model you have fitted for some threshold x0. alpha_hat is the fitted Pareto parameter. [For each of different threshold values, you'll get different alpha_hat, and thence different S_predicted]
1 like
Comment

Announcement

K-S test for Pareto distributed data.

Comment

Comment

Comment

Comment

Comment