Distribution Test

Niyaj Mohammed

Join Date: Dec 2023

Posts: 3
#1

Distribution Test

01 Feb 2025, 00:13

Hi

I am using the First Information Report (FIR) data for my analysis. Certain crimes, such as theft, often lack recorded accused names. Consequently, I have excluded these cases from my analysis. To ensure that this exclusion does not introduce bias, I need to demonstrate that the distribution of the dropped cases is not significantly different from the distribution of the remaining dataset, where accused names are available. Is there any statistical test available other than the Kolmogorov-Simronov test or any other ways to show this?

Thanks
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17724
#2

01 Feb 2025, 03:02

Niyaj:
you' may to take a look at the community-contributed module -mcartest-.
Type -search mcartest- to find it.

Kind regards,
Carlo
(Stata 19.0)
Comment

George Ford

Join Date: Aug 2014
Posts: 3177

01 Feb 2025, 09:43

mcartest is designed for this.

I've thrown y in here but you may to exclude it.

Code:

clear all

set obs 1000

g x1 = rgamma(5,1)
g x2 = runiform()
g x3 = rgamma(3,2)
g y = x1 + x2 + x3 + rnormal()
replace x3 = . if runiform()>0.9
g missing = mi(x3)
summ missing

mcartest x1 x2 x3 y
logit missing x1 x2 y
covbal missing x1 x2 y


clear all

set obs 1000

g x1 = rgamma(5,1)
g x2 = runiform()
g x3 = rgamma(3,2)
g y = x1 + x2 + x3 + rnormal()
replace x3 = . if runiform()>0.5 & x1>7
g missing = mi(x3)
summ missing

mcartest x1 x2 x3 y
logit missing x1 x2 y
covbal missing x1 x2 y

Last edited by George Ford; 01 Feb 2025, 09:56.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35754
#4

02 Feb 2025, 01:57

A quantile-quantile plot is often a good way to compare two distributions. One way into literature is through https://journals.sagepub.com/doi/pdf...6867X241276114
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3464
#5

02 Feb 2025, 08:03

[QUOTE=George Ford;n1771939]
I've thrown y in here but you may to exclude it.
[\QUOTE]

Absolutely not, the y must remain in. Missing values will only bias the model if the chance of missingness depends on y. That chance can depend on any or multiple xs. As long as it is independent of y, the model will be unbiased. So the relationship between y and missingness is the only thing you care about.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3464
#6

02 Feb 2025, 08:04

[QUOTE=Maarten Buis;n1771973]

Originally posted by George Ford View Post

I've thrown y in here but you may to exclude it.

Absolutely not, the y must remain in. Missing values will only bias the model if the chance of missingness depends on y. That chance can depend on any or multiple xs. As long as it is independent of y, the model will be unbiased. So the relationship between y and missingness is the only thing you care about.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4485
#7

02 Feb 2025, 11:33

the OP does. not say anything about sample size but if it large, the use of -mcartest- (or related techniques) will be misleading because all p-values will be very low

also, I agree strongly with Maarten Buis ; for a recent piece on this, see McGowan, LD'A, et al. (2024), "The “Why” behind including “Y” in your imputation model ", Statistical methods in medical research, 33(6): 996-1020
3 likes
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2188
#8

02 Feb 2025, 20:48

If you're essentially doing a regression analysis -- even if something like Poisson regression -- you do not need the entire distribution to be the same. I'm going to call the variable y that you always observe but do not always know the identity. I assume this means you can't merge with other data. Because y is a count (that presumably has lots of zeros), I probably would use Poisson regression and include a dummy variable for whether you observe the label. Then a robust t test tells you if the means are different. That would be an issue. But if the goal is regression-type analysis, you don't need to know if variances or other features of the distribution are the same.
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3464
#9

03 Feb 2025, 03:00

I will extend a bit on my previous comment (#6) and show why this statement is true. When we are doing a regression we are interested in some function (typically the mean) of the distribution of the dependent/explained/left-hand-side/endogenous variable \(y\) given the independent/explanatory/right-hand-side/exogenous variables \(x\): \(f(y | x)\). However when we have missing values, and we ignore all observations with missing values we use the distribution of \(y\) given \(x\) and being fully observed. Lets add a variable \(m\) which is 1 if any variable is missing and 0 if an observation is fully observed. So we use: \(f(y|x, m=0)\) instead of \(f(y|x)\). Using Bayes' theorem, we can write the model we estimate as:

\(
f(y|x, m=0) = \frac{f(y,x,m=0)}{f(x,m=0)}
\)

\(
= \frac{Pr(m=0|y,x) f(y|x) f(x) }{Pr(m=0|x) f(x) }
\)

If the probability of missingness depends on \(x\) but not on \(y\), we can rewrite \(Pr(m=0|y,x)\) as \(Pr(m=0|x)\). So we have:

\(
= \frac{Pr(m=0|x) f(y|x) f(x) }{Pr(m=0|x) f(x) } = f(y|x)
\)

So \(f(y|x, m=0) = f(y|x)\) as long as the probability of missingness is independent of \(y\), and a model estimated on only the observed observations will be unbiased.

I find it often helpful to also create a simulation to get a feel for what is going on. Here I create data such that the regression model in the population would have parameters \(\beta_1=3\) and \(\beta_2=1\). We have a chance of being missing that depends on x1 and x2 but not on y. So we expect the estimates to be unbiased.

Code:

. clear all . set seed 123456 . . program define sim 1. drop _all 2. set obs 1000 3. gen x1 = rnormal() 4. gen x2 = _n < 501 5. gen y = 1 + 3*x1 + 1*x2 + rnormal(0,4) 6. . // create missing values independent of y . gen p = invlogit(`=ln(.4)' + `=ln(1.2)'*x1 + `=ln(2)'*x2) 7. replace x2 = . if runiform() < p 8. . // estimate the regression . reg y x1 x2 9. end . . simulate b1=_b[x1] b2=_b[x2] , reps(10000) : sim Command: sim b1: _b[x1] b2: _b[x2] Simulations (10,000): .........10.........20.........30.........40.........50.........60.........70.........80.........90.........100.........110........ > .120.........130.........140.........150.........160.........170.........180.........190.........200.........210.........220.........230.........240... [snip] > ........9,850.........9,860.........9,870.........9,880.........9,890.........9,900.........9,910.........9,920.........9,930.........9,940.........9,9 > 50.........9,960.........9,970.........9,980.........9,990.........10,000 done . sum b* Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- b1 | 10,000 2.999831 .1584125 2.403287 3.57203 b2 | 10,000 .9992695 .3237411 -.3061394 2.217188

Last edited by Maarten Buis; 03 Feb 2025, 03:08.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
3 likes
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4485
#10

03 Feb 2025, 07:39

Jeff Wooldridge I find your post above (#8) confusing; as I read the OP, this is a missing data problem and the OP wants to do a complete case analysis and thus would like to think of this as an MCAR situation; however, it might be MAR or even MNAR in which case a complete case analysis is likely to be biased - I think you may be reading the original post differently but would appreciate it if you could explain how you are interpreting it
1 like
Comment
George Ford

Join Date: Aug 2014

Posts: 3177
#11

04 Feb 2025, 10:08

I think what JW is saying is

poisson Y X1 X2 X3 dummy_missing , robust

the coefficient on dummy_missing is your test.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4485
#12

04 Feb 2025, 11:36

if Jeff Wooldridge is saying that then I completely missed it and still don't see it; further, I think that is wrong; a test of this kind can be done via regression as follows: make an indicator variable which tells you whether the observation has missing data; that indicator becomes the output in a logistic regression with the other variables as predictors; if any of the predictors are associated with the outcome, then the missingness is not MCAR
Comment
George Ford

Join Date: Aug 2014

Posts: 3177
#13

04 Feb 2025, 16:49

Y is the variable with missing values, not the Y outcome..
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment