Measures to compare distributions and the Probability-Probability-Plot

Steffen Pluetzke

Join Date: Mar 2019

Posts: 20
#1

Measures to compare distributions and the Probability-Probability-Plot

24 May 2019, 01:53

Hi Statalisters,

I want to compare the distributional fit of predicted values with the original data. The user written Stata command 'ppplot' by Nicholas J. Cox plots the percentiles of the predictions with the percentiles of the original data. A diagonal line is a perfect distributional fit. In 'Hinloopen, Jeroen; van Marrewijk, Charles (2005) : Comparing Distributions: The Harmonic Mass Index, Tinbergen Institute Discussion Paper, No. 05-122/1' the authors introduce the harmonic mass index which measures the area between the diagonal (e.g. the perfect fit) and the P-P-plot. Besides the graph I want to incorporate this measure in my study, but I am fairly new to Stata and haven't figured out a way how to calculate the HMI.

How can I calculate this measure and are there others measures helping to compare the distributional fit of predictions and original data (besides the ppplot and ksmirnov)?

Kind regards,

Steffen
Tags: None

Scott Merryman

Join Date: Mar 2014
Posts: 895

24 May 2019, 13:01

LIke this?

Code:

sysuse auto,clear
ppplot connected mpg, by(fore) legend(off)  plot(function y = x, clp(dash)) /// 
    name(ppp_original,replace)

///recreate ppplot data
separate mpg, by(fore)
sort mpg
foreach v of varlist mpg0 mpg1 {
    format `v' %2.1g 
    replace `v' = sum(1* (`v' < .)) 
    by mpg : replace `v' = `v'[_N] 
    replace `v' = `v' / `v'[_N]
}   
//check with ppp_original
twoway connected mpg0 mpg1 || line mpg1 mpg1  , lp(dash) || ,legend(off) /// 
    xla(0(0.2)1) xtitle("") name(myppp, replace)
qui integ mpg0 mpg1 , trapezoid
//Subtract lower triangle
disp "HMI = " r(integral) -.5

Comment

Steffen Pluetzke

Join Date: Mar 2019
Posts: 20

25 May 2019, 09:03

Hi Scott,

thank you so much. That helped a lot and is nearly what I intended. 2 follow-ups:

1. I want to compare the distribution of a variable and the corresponding prediction and not the distribution of one variable but two subpopulations. I used your code and a workaround by predicting the outcome, then expanding the dataset and creating a group variable which divides the data into two groups (real data and prediction). Is there an easier way without expanding the dataset?
2. The HMI measure is the area between the ppplot and the diagonal. Therefore I modified the calculation a bit.

The modified code is below:

Code:

sysuse auto,clear
regress price mpg headroom trunk weight length turn displacement gear_ratio foreign
predict z2
gen z1 = price
ppplot connected z1 z2, legend(off)  plot(function y = x, clp(dash)) name(ppp_original,replace)

expand 2
gen group = _n
replace group = 1 if _n>_N/2
replace group = 2 if _n<=_N/2
gen z3 = .
replace z3 = z1 if group==1
replace z3 = z2 if group==2
ppplot connected z3, by(group) legend(off)  plot(function y = x, clp(dash)) name(ppp_original_test,replace)

separate z3, by(group)
sort z3
foreach v of varlist z31 z32 {
    format `v' %2.1g
    replace `v' = sum(1* (`v' < .))
    by z3 : replace `v' = `v'[_N]
    replace `v' = `v' / `v'[_N]
}  

twoway connected z31 z32 || line z32 z32  , lp(dash) || ,legend(off) xla(0(0.2)1) xtitle("") name(myppp, replace)
gen z33 = abs(z31-z32)
qui integ z33 z32, trapezoid
disp "HMI = "r(integral)*2
gen z34 = (z31-z32)^2
qui integ z34 z32, trapezoid
disp "HWMI = "r(integral)*2

Again, thank you.

Kind regards,

Steffen

Announcement