Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Distribution Test

    Hi


    I am using the First Information Report (FIR) data for my analysis. Certain crimes, such as theft, often lack recorded accused names. Consequently, I have excluded these cases from my analysis. To ensure that this exclusion does not introduce bias, I need to demonstrate that the distribution of the dropped cases is not significantly different from the distribution of the remaining dataset, where accused names are available. Is there any statistical test available other than the Kolmogorov-Simronov test or any other ways to show this?


    Thanks

  • #2
    Niyaj:
    you' may to take a look at the community-contributed module -mcartest-.
    Type -search mcartest- to find it.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      mcartest is designed for this.

      I've thrown y in here but you may to exclude it.

      Code:
      clear all
      
      set obs 1000
      
      g x1 = rgamma(5,1)
      g x2 = runiform()
      g x3 = rgamma(3,2)
      g y = x1 + x2 + x3 + rnormal()
      replace x3 = . if runiform()>0.9
      g missing = mi(x3)
      summ missing
      
      mcartest x1 x2 x3 y
      logit missing x1 x2 y
      covbal missing x1 x2 y
      
      
      clear all
      
      set obs 1000
      
      g x1 = rgamma(5,1)
      g x2 = runiform()
      g x3 = rgamma(3,2)
      g y = x1 + x2 + x3 + rnormal()
      replace x3 = . if runiform()>0.5 & x1>7
      g missing = mi(x3)
      summ missing
      
      mcartest x1 x2 x3 y
      logit missing x1 x2 y
      covbal missing x1 x2 y
      Last edited by George Ford; 01 Feb 2025, 09:56.

      Comment


      • #4
        A quantile-quantile plot is often a good way to compare two distributions. One way into literature is through https://journals.sagepub.com/doi/pdf...6867X241276114

        Comment


        • #5
          [QUOTE=George Ford;n1771939]
          I've thrown y in here but you may to exclude it.
          [\QUOTE]

          Absolutely not, the y must remain in. Missing values will only bias the model if the chance of missingness depends on y. That chance can depend on any or multiple xs. As long as it is independent of y, the model will be unbiased. So the relationship between y and missingness is the only thing you care about.
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            [QUOTE=Maarten Buis;n1771973]
            Originally posted by George Ford View Post
            I've thrown y in here but you may to exclude it.
            Absolutely not, the y must remain in. Missing values will only bias the model if the chance of missingness depends on y. That chance can depend on any or multiple xs. As long as it is independent of y, the model will be unbiased. So the relationship between y and missingness is the only thing you care about.

            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              the OP does. not say anything about sample size but if it large, the use of -mcartest- (or related techniques) will be misleading because all p-values will be very low

              also, I agree strongly with Maarten Buis ; for a recent piece on this, see McGowan, LD'A, et al. (2024), "The “Why” behind including “Y” in your imputation model ", Statistical methods in medical research, 33(6): 996-1020

              Comment


              • #8
                If you're essentially doing a regression analysis -- even if something like Poisson regression -- you do not need the entire distribution to be the same. I'm going to call the variable y that you always observe but do not always know the identity. I assume this means you can't merge with other data. Because y is a count (that presumably has lots of zeros), I probably would use Poisson regression and include a dummy variable for whether you observe the label. Then a robust t test tells you if the means are different. That would be an issue. But if the goal is regression-type analysis, you don't need to know if variances or other features of the distribution are the same.

                Comment


                • #9
                  I will extend a bit on my previous comment (#6) and show why this statement is true. When we are doing a regression we are interested in some function (typically the mean) of the distribution of the dependent/explained/left-hand-side/endogenous variable \(y\) given the independent/explanatory/right-hand-side/exogenous variables \(x\): \(f(y | x)\). However when we have missing values, and we ignore all observations with missing values we use the distribution of \(y\) given \(x\) and being fully observed. Lets add a variable \(m\) which is 1 if any variable is missing and 0 if an observation is fully observed. So we use: \(f(y|x, m=0)\) instead of \(f(y|x)\). Using Bayes' theorem, we can write the model we estimate as:

                  \(
                  f(y|x, m=0) = \frac{f(y,x,m=0)}{f(x,m=0)}
                  \)

                  \(
                  = \frac{Pr(m=0|y,x) f(y|x) f(x) }{Pr(m=0|x) f(x) }
                  \)

                  If the probability of missingness depends on \(x\) but not on \(y\), we can rewrite \(Pr(m=0|y,x)\) as \(Pr(m=0|x)\). So we have:

                  \(
                  = \frac{Pr(m=0|x) f(y|x) f(x) }{Pr(m=0|x) f(x) } = f(y|x)
                  \)

                  So \(f(y|x, m=0) = f(y|x)\) as long as the probability of missingness is independent of \(y\), and a model estimated on only the observed observations will be unbiased.

                  I find it often helpful to also create a simulation to get a feel for what is going on. Here I create data such that the regression model in the population would have parameters \(\beta_1=3\) and \(\beta_2=1\). We have a chance of being missing that depends on x1 and x2 but not on y. So we expect the estimates to be unbiased.

                  Code:
                  . clear all
                  
                  . set seed 123456
                  
                  .
                  . program define sim
                    1.         drop _all
                    2.         set obs 1000
                    3.         gen x1 = rnormal()
                    4.         gen x2 = _n < 501
                    5.         gen y = 1 + 3*x1 + 1*x2 + rnormal(0,4)
                    6.
                  .         // create missing values independent of y
                  .         gen p = invlogit(`=ln(.4)' + `=ln(1.2)'*x1 + `=ln(2)'*x2)
                    7.         replace x2 = . if runiform() < p
                    8.
                  .         // estimate the regression
                  .         reg y x1 x2
                    9. end
                  
                  .
                  . simulate b1=_b[x1] b2=_b[x2] , reps(10000) : sim
                  
                        Command: sim
                             b1: _b[x1]
                             b2: _b[x2]
                  
                  Simulations (10,000): .........10.........20.........30.........40.........50.........60.........70.........80.........90.........100.........110........
                  > .120.........130.........140.........150.........160.........170.........180.........190.........200.........210.........220.........230.........240...
                  [snip]
                  > ........9,850.........9,860.........9,870.........9,880.........9,890.........9,900.........9,910.........9,920.........9,930.........9,940.........9,9
                  > 50.........9,960.........9,970.........9,980.........9,990.........10,000 done
                  
                  . sum b*
                  
                      Variable |        Obs        Mean    Std. dev.       Min        Max
                  -------------+---------------------------------------------------------
                            b1 |     10,000    2.999831    .1584125   2.403287    3.57203
                            b2 |     10,000    .9992695    .3237411  -.3061394   2.217188
                  Last edited by Maarten Buis; 03 Feb 2025, 03:08.
                  ---------------------------------
                  Maarten L. Buis
                  University of Konstanz
                  Department of history and sociology
                  box 40
                  78457 Konstanz
                  Germany
                  http://www.maartenbuis.nl
                  ---------------------------------

                  Comment


                  • #10
                    Jeff Wooldridge I find your post above (#8) confusing; as I read the OP, this is a missing data problem and the OP wants to do a complete case analysis and thus would like to think of this as an MCAR situation; however, it might be MAR or even MNAR in which case a complete case analysis is likely to be biased - I think you may be reading the original post differently but would appreciate it if you could explain how you are interpreting it

                    Comment


                    • #11
                      I think what JW is saying is

                      poisson Y X1 X2 X3 dummy_missing , robust

                      the coefficient on dummy_missing is your test.

                      Comment


                      • #12
                        if Jeff Wooldridge is saying that then I completely missed it and still don't see it; further, I think that is wrong; a test of this kind can be done via regression as follows: make an indicator variable which tells you whether the observation has missing data; that indicator becomes the output in a logistic regression with the other variables as predictors; if any of the predictors are associated with the outcome, then the missingness is not MCAR

                        Comment


                        • #13
                          Y is the variable with missing values, not the Y outcome..

                          Comment

                          Working...
                          X