Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deleting observations with missing values

    Dear all,

    For my thesis I have to research the causes of layoffs for (big) Belgian firms. My data consists of unbalanced panel data, period 2011-2020. In the first part of my results section I analyse the summary statistics and bivariate analysis (t-test/ranksum test).
    The screenshot gives a table of the summary statistics of my variables. Sorry for it being in Dutch, but my problem should be clear with the information underneath the screenshot.

    Click image for larger version

Name:	beschrstat.png
Views:	1
Size:	73.1 KB
ID:	1664558
    (Variables; #observations; mean; median, min; max)


    Due to the fact that I have to much variation in #observations for my variables (for example 185 147 for Ln(Age) & only 106 419 for my 2 independent variables about productivity --> Apparently my regression wil take maximum those 106 419 firms into account --> those 80 000 extra observations can skew my sample significantly..
    My promotor advised me to delete observations with a missing value.

    After reading the following article ( https://www.stata.com/support/faqs/d...issing-values/ ), I do not quite understand the proposed solution. The article talks about missing values at the beginning and end, but in my case I do not quite understand how to fix the problem.
    Originally I thought the following code (and doing this for all my variables) would be OK, but apparently it is not:
    Code:
     
     drop if DalingProductiviteit2J >= .
    I understand that this is rather a dumb question, but apparently complex enough for me. I hope that the problem is clearly described. If not, please let me know.

    Thanks in advance,
    Jordi


  • #2
    perhaps,
    Code:
    egen unwanted = rowmiss(_all)
    drop if unwanted

    Comment


    • #3
      Thank you!

      Right before I saw your comment I used the following command:
      Code:
      gen dummyMISSINGPROD = 0
      replace dummyMISSINGPROD = 1 if !missing( DalingProductiviteit2J)
      drop if dummyMISSINGPROD == 0
      Just to be sure: if I drop all the observations with missing values like this right before I analyse my data with summary statistics, t-test/Wilcoxon ranksum test, ... Stata will now only take those observations into account that have a value for DalingProductiviteit2J? So that my regression (xtlogit) will analyse the same amount of observations like the Wilcoxon ranksum test etc does?

      EDIT:
      My summary statistics look like the following after using the commands above:
      Variable Obs Mean Std. Dev. Min Max
      Collectief~t 92,838 .1037075 .3048824 0 1
      Productivi~5 93,552 781421.9 1112061 51460.25 4587718
      DalingPro~1J 93,552 .4275483 .4947255 0 1
      DalingPro~2J 93,552 .1803489 .3844798 0 1
      onder_medi~d 93,552 .4900911 .4999045 0 1
      ROA_w1 93,547 .0850092 .137789 -.8055618 .5625707
      DalingROA1J 93,547 .5168097 .49972 0 1
      DalingROA2J 93,546 .2380006 .4258617 0 1
      onder_medi~A 93,547 .4397896 .4963641 0 1
      GUO 93,539 .3199307 .4664519 0 1
      DUO 93,539 .3215343 .4670678 0 1
      Groepsbedr~f 93,552 .6413759 .4795991 0 1
      StandAlone 93,539 .3585349 .4795728 0 1
      Lnleeftijd~5 93,552 3.144766 .6948031 1.098612 4.143135
      Lngrootte_w5 93,552 16.30382 1.455338 12.64215 19.04326
      Schuldgraa~5 93,549 .5829031 .2734962 .0288165 1.101723
      MVAratio_w5 88,454 .1833262 .2160198 .0009628 .8730854
      Is this okay to proceed with bivariate and multivariate analysis? Or should I drop all the observations where a missing value has been reported, so that eventually all the variables have the exact same amount of observations?
      Last edited by Jordi Imbrechts; 14 May 2022, 08:14.

      Comment


      • #4
        Jordi:
        not quite.
        Stata will omit all the observations with a missing value in at least one variable.
        In addition, dropping all the observations with missing values yourself in order to go complete case analysis without diagnosing the mechanism underlying their missingness, is a (very) risky methodological approach, as you may (easily) end up with a biased subsample with a tenuous relationship with the original one.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Carlo,

          I used the following code instead:
          Code:
          egen unwanted = rowmiss(_all) 
           drop if unwanted
          As a result, this is what my summary statistics look like:
          Click image for larger version

Name:	sumstatistic.png
Views:	1
Size:	27.8 KB
ID:	1664594


          DalingPROD1J & DalingROA1J are both dummy variables that have a missing value for the first observation (hence the difference in #obs)
          DalingPROD2J & DalingROA2J are both dummy variables that have a missing value for the first 2 observations

          Is this also a (very) risky methodological approach?

          Kind regards,
          Jordi

          Comment


          • #6
            Jordi:
            the risk is related to the reason why those data are missing.
            If you're sure that they are missing completely at random (se -mi- glossary), your resulting dataset will be a random subsample of your original one.
            As far as I can get your screenshot (BTW: as per FAQ screnshots are not the recommended way to share your Stata codes/results), you have a theoretical sample of 115,381 observations that is expected to lose >20,000 (83,390) due to missing values.
            Obviously, what above holds assuming that all the reported variables will be used in your panel data regression.
            Last edited by Carlo Lazzaro; 14 May 2022, 12:18.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Carlo,

              I do not completely understand your comment. My bad.
              I exported all my data from Bel-first database and generated some new variables in Stata. Before using the "drop if unwanted" command, the missing values were purely because of the fact that for example a firm does not have any data for that variable. And some others firm for example did not have any data for another variable. So yes, they were random.
              By using "drop if unwanted" my unbalanced panel data changed to unbalanced panel data with gaps.

              Is this methodology correct?

              Kind regards, Jordi

              Comment


              • #8
                Is there any way to delete a post? I realize this was a relatively dumb question.

                Comment


                • #9
                  Jordi:
                  you can delete a post within 1 hour form posting it.
                  That said, your question was not dumb at all if you felt the need to post it.
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Carlo,
                    That is true . Thanks anyways!

                    Comment

                    Working...
                    X