Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • K-S test for Pareto distributed data.

    Hello everyone,

    for my master's thesis I am working with a survey dataset on wealth distribution where I am re-estimating the top tail of the distribution, i.e. the richest households. Since I assume that my wealth data follows a Pareto distribution, I need to find xmin, e.g. the right scale parameter in order to identify the cut-off-point. I have chosen the upper 10 percentiles of my wealth variable for possible xmins. For these 10 percentiles I calculated the respective CCDFs and estimated the 10 possible shape parameters "alpha".

    I now want to perform a Kolgomorov-Smirnov test to identify the correct xmin using the p-values. If I understand the test correctly, it is primarily intended to test non-normal distributions. The assumed empirical (in my case Pareto) distribution is compared to a hypothetical distribution and the distance between both is calculated. If H0 cannot be rejected, the hypothetical fits the empirical model.

    "Help ksmirnov" says:
    Code:
    ksmirnov varname = exp [if] [in]
    where varname is the variable whose distribution is being tested, and exp must evaluate to the corresponding (theoretical) cumulative. In the example, they test for normal distribution and do so by putting in the function "normal(mean/sd)" for exp. Here's where I'm stuck and I think I'm lacking some theoretical background. I assumed that I would use a pareto function in the same way. And I would do that for all 10 percentiles of my wealth distribution. However, Stata only offers "rpareto", whereby nothing useful comes up. What do I have to put in for exp to test my data for pareto distribution? Do I have to calculate a hypothetical Pareto distribution to test my data against? If so, how would I do that?

    Thank you and kind regards
    Moritz

  • #2
    I can't follow what you're doing here. The minimum value for the simplest kind of Pareto distribution is in my experience just taken to be the minimum observed value and then the other parameter is (also) estimated by maximum likelihood. How the upper [?] percentiles appear is unclear. After that I wouldn't use a Kolmogorov-Smirnov test at all, as it tends to have the wrong kind of sensitivity and the usual procedure makes no allowance for parameters being estimated from the data. I'd use a quantile-quantile plot. That said, this distribution is of interest only for right-skewed distributions in which outliers are likely and good fits are often difficult or even controversial given many other candidate distributions s with different brand names.

    paretofit from SSC by Stephen Jenkins and Philippe Van Kerm is a tool of choice.

    Comment


    • #3
      My explanation might be a bit sloppy, but basically it's what Stephen Jenkins did on p.12 in "Jenkins, Stephen P. (2016) Pareto models, top incomes, and recent trends in UK income inequality. Economica. ISSN 0013-0427" and what he also describes in this post from a couple of years ago https://www.statalist.org/forums/for...g-alpha-and-x0. There he states: "Fit your model using -paretofit- for multiple thresholds over a plausible range, and from this, calculate the K-S statistics." I already did the first part and obtained multiple estimated Pareto alphas for my possible thresholds. The second part is what troubles me as I'm not sure how to translate this step into Stata.

      To compare the empirical with the theoretical cdf, my thinking here is, based on my wealth variable, I sort the data in acceding order and then use -cumul- to create the empirical cdfs for my 10 thresholds. The theoretical cdf I built by using the formula for the same thresholds:
      Code:
      cdf = 1-(xmin/x)^alpha
      However, the test results I obtained let me believe that I did something wrong. Here are the results for the first first threshold, all other results are identical in p-values and only slightly differ in the test statistics D for ecdf_* and combined Combined K-S.
      Code:
      One-sample Kolmogorov–Smirnov test against theoretical distribution tcdf_1_1
      
      Smaller group             D     p-value  
      ---------------------------------------
      ecdf_1_1             1.0122       0.000
      Cumulative           0.0000       1.000
      Combined K-S         1.0122       0.000
      Here is the code I used for creating the ecdfs and tcdfs, I assume the error is somewhere in there (note that I keep using the 1/5 outer loop, as I have an imputed dataset):
      Code:
      // create ecdfs
      forvalues i = 1/5 {
          gsort -rank_`i'
          forvalues j = 1/10 {
              cumul netwealth_`j'_`i', gen(ecdf_`j'_`i')
          }
      }
      
      // theoretical cdfs
      local q = 1
      forvalues s = 1/5 {
          gsort -rank_`q'
          gen running_sum_`s' = sum(weight)
          by percent_`q' (rank_`q'), sort: gen group_total_`s' = running_sum_`s'[_N]
          local q = `q' + 1
      }
      
      local j = 1
      local k = 1
      foreach var of varlist percent_* {
          gsort -rank_`j'
          forvalues i = 1/10 {
              gen ni_`i'_`k' = running_sum_`j' if `var' <= `i'
              egen nall_`i'_`k' = min(cond(`var' == `i', group_total_`j', .))
              forvalues q = 1/5 {
                  local alpha_ij = alphas_`q'[`i',1]
              }
              gen tcdf_`i'_`k' = 1-(ni_`i'_`k'/nall_`i'_`k')^(`alpha_ij')    
          }
          local j = `j' + 1
          local k = `k' + 1
      }
      
      // KS Test 
      forvalues i = 1/5 {
          forvalues j = 1/10 {
              ksmirnov ecdf_`j'_`i' = tcdf_`j'_`i'
          }
      }
      any help is much appreciated
      Thank you

      Comment


      • #4
        The "K-S test" in my paper is the test with that label described by Clauset et al. (2009) in the context of choosing a threshold (i.e., no originality from me). See their paper. You need to calculate statistics like "max |S_empirical - S_predicted|", where S refers to survival function, for a range of thresholds and eyeball those (as Clauset et al.) "predicted" is dervied from fitted model given that threshold.

        Put differently, I never used -ksmirnov-

        Comment


        • #5
          You are right, I'm sorry, I should have put it more precisely that you didn't use the test and were only describing what Clauset et al. propose. I did read their paper but and I get the overall idea.
          In order not to cause any further confusion, I want to narrow my question down to this:

          If I want to perform the KS Test, how do I calculate the theoretical CDF? If 1-(xmin/x)^alpha is the formula, what do I put in for xmin and x?

          Comment


          • #6
            S_predicted is derived from the Pareto model you have fitted for some threshold x0. alpha_hat is the fitted Pareto parameter. [For each of different threshold values, you'll get different alpha_hat, and thence different S_predicted]

            Comment

            Working...
            X