Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Measures to compare distributions and the Probability-Probability-Plot

    Hi Statalisters,

    I want to compare the distributional fit of predicted values with the original data. The user written Stata command 'ppplot' by Nicholas J. Cox plots the percentiles of the predictions with the percentiles of the original data. A diagonal line is a perfect distributional fit. In 'Hinloopen, Jeroen; van Marrewijk, Charles (2005) : Comparing Distributions: The Harmonic Mass Index, Tinbergen Institute Discussion Paper, No. 05-122/1' the authors introduce the harmonic mass index which measures the area between the diagonal (e.g. the perfect fit) and the P-P-plot. Besides the graph I want to incorporate this measure in my study, but I am fairly new to Stata and haven't figured out a way how to calculate the HMI.

    How can I calculate this measure and are there others measures helping to compare the distributional fit of predictions and original data (besides the ppplot and ksmirnov)?

    Kind regards,

    Steffen

  • #2
    LIke this?
    Code:
    sysuse auto,clear
    ppplot connected mpg, by(fore) legend(off)  plot(function y = x, clp(dash)) /// 
        name(ppp_original,replace)
    
    ///recreate ppplot data
    separate mpg, by(fore)
    sort mpg
    foreach v of varlist mpg0 mpg1 {
        format `v' %2.1g 
        replace `v' = sum(1* (`v' < .)) 
        by mpg : replace `v' = `v'[_N] 
        replace `v' = `v' / `v'[_N]
    }   
    //check with ppp_original
    twoway connected mpg0 mpg1 || line mpg1 mpg1  , lp(dash) || ,legend(off) /// 
        xla(0(0.2)1) xtitle("") name(myppp, replace)
    qui integ mpg0 mpg1 , trapezoid
    //Subtract lower triangle
    disp "HMI = " r(integral) -.5

    Comment


    • #3
      Hi Scott,

      thank you so much. That helped a lot and is nearly what I intended. 2 follow-ups:

      1. I want to compare the distribution of a variable and the corresponding prediction and not the distribution of one variable but two subpopulations. I used your code and a workaround by predicting the outcome, then expanding the dataset and creating a group variable which divides the data into two groups (real data and prediction). Is there an easier way without expanding the dataset?
      2. The HMI measure is the area between the ppplot and the diagonal. Therefore I modified the calculation a bit.

      The modified code is below:
      Code:
      sysuse auto,clear
      regress price mpg headroom trunk weight length turn displacement gear_ratio foreign
      predict z2
      gen z1 = price
      ppplot connected z1 z2, legend(off)  plot(function y = x, clp(dash)) name(ppp_original,replace)
      
      expand 2
      gen group = _n
      replace group = 1 if _n>_N/2
      replace group = 2 if _n<=_N/2
      gen z3 = .
      replace z3 = z1 if group==1
      replace z3 = z2 if group==2
      ppplot connected z3, by(group) legend(off)  plot(function y = x, clp(dash)) name(ppp_original_test,replace)
      
      separate z3, by(group)
      sort z3
      foreach v of varlist z31 z32 {
          format `v' %2.1g
          replace `v' = sum(1* (`v' < .))
          by z3 : replace `v' = `v'[_N]
          replace `v' = `v' / `v'[_N]
      }  
      
      twoway connected z31 z32 || line z32 z32  , lp(dash) || ,legend(off) xla(0(0.2)1) xtitle("") name(myppp, replace)
      gen z33 = abs(z31-z32)
      qui integ z33 z32, trapezoid
      disp "HMI = "r(integral)*2
      gen z34 = (z31-z32)^2
      qui integ z34 z32, trapezoid
      disp "HWMI = "r(integral)*2
      Again, thank you.

      Kind regards,

      Steffen

      Comment

      Working...
      X