Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there a way in Stata to drop cases if the percentage of missing is greater than certain value according to key variable?

    Click image for larger version

Name:	dropcases.png
Views:	1
Size:	14.0 KB
ID:	1477885

    Hi all! I want to drop cases if the percentage of missing is greater than certain value according to key variables.

    For example, in Belgium and other courties (variable cid), the percentage of missing on the key variable eat is great than 50%. I do not want include the cases into logit model, how can I do?

    I installed missings command deveolped by Nicholas J. Cox(https://www.stata-journal.com/articl...article=dm0085), and did not figure out, perpaps my question is beyond the missings command.

    Anyone can give me some advices?

  • #2
    Code:
    bysort cid : gen mis = missing(eat)
    by     cid : replace mis = sum(mis)
    by     cid : replace mis = mis[_N]/_N*100
    gen notuse = mis > 50
    drop if notuse == 1
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Maarten gave excellent code.

      Since you said you failed to figure out how to do it with the user-written missings (SJ, Nick Cox), this is a toy example:

      Code:
      . webuse nlswork
      (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
      
      . * Let's say we wish to drop the variables with more than 20% by race
      
      . by race, sort: missings report, percent
      
      ----------------------------------------------------------------------------------------------------------
      -> race = white
      
      Checking missings in all variables:
      10576 observations with missing values
      
      ----------------------------
                |      #        %
      ----------+-----------------
            age |     15     0.07
            msp |      6     0.03
        nev_mar |      6     0.03
          grade |      1     0.00
       not_smsa |      6     0.03
         c_city |      6     0.03
          south |      6     0.03
       ind_code |    257     1.27
       occ_code |     81     0.40
          union |   6586    32.64
         wks_ue |   3924    19.44
         tenure |    321     1.59
          hours |     47     0.23
       wks_work |    478     2.37
      ----------------------------
      
      ----------------------------------------------------------------------------------------------------------
      -> race = black
      
      Checking missings in all variables:
      4342 observations with missing values
      
      ----------------------------
                |      #        %
      ----------+-----------------
            age |      9     0.11
            msp |     10     0.12
        nev_mar |     10     0.12
          grade |      1     0.01
       not_smsa |      2     0.02
         c_city |      2     0.02
          south |      2     0.02
       ind_code |     80     0.99
       occ_code |     38     0.47
          union |   2618    32.52
         wks_ue |   1710    21.24
         tenure |    110     1.37
          hours |     17     0.21
       wks_work |    219     2.72
      ----------------------------
      
      ----------------------------------------------------------------------------------------------------------
      -> race = other
      
      Checking missings in all variables:
      164 observations with missing values
      
      --------------------------
                |    #        %
      ----------+---------------
       ind_code |    4     1.32
       occ_code |    2     0.66
          union |   92    30.36
         wks_ue |   70    23.10
         tenure |    2     0.66
          hours |    3     0.99
       wks_work |    6     1.98
      --------------------------
      
      . drop union wks_ue
      This way, with only two lines, you can check how many variables went over the missing-value threshold ( if there were many, the inspecting process gets streamlined) and, finally, the "final decision" to delete the variables is, well, in your hands.

      Hopefully that helps.
      Last edited by Marcos Almeida; 09 Jan 2019, 05:20.
      Best regards,

      Marcos

      Comment


      • #4
        In missings from the Stata Journal, the absence of any options to drop observations and/or variables if some but not all values are missing is deliberate. The help explains:

        Creating entirely empty observations (rows) and variables (columns) is a habit of many spreadsheet users,
        but neither is helpful in Stata datasets. The subcommands dropobs and dropvars should help users clean up.
        Conversely, there is no explicit support here for dropping observations or variables with some missing and
        some nonmissing values. Users so minded will find other subcommands of use as an intermediate step, but
        multiple imputation might be a better way forward.

        Comment


        • #5
          Thank Maarten Buis, Marcos Almeida and Nick Cox for your kind help.

          Comment

          Working...
          X