Is there a way in Stata to drop cases if the percentage of missing is greater than certain value according to key variable?

Qiguo Lian

Join Date: May 2014

Posts: 27
#1

Is there a way in Stata to drop cases if the percentage of missing is greater than certain value according to key variable?

09 Jan 2019, 02:54

Hi all! I want to drop cases if the percentage of missing is greater than certain value according to key variables.

For example, in Belgium and other courties (variable cid), the percentage of missing on the key variable eat is great than 50%. I do not want include the cases into logit model, how can I do?

I installed missings command deveolped by Nicholas J. Cox(https://www.stata-journal.com/articl...article=dm0085), and did not figure out, perpaps my question is beyond the missings command.

Anyone can give me some advices?
Tags: missing value
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#2

09 Jan 2019, 03:49

Code:

bysort cid : gen mis = missing(eat) by cid : replace mis = sum(mis) by cid : replace mis = mis[_N]/_N*100 gen notuse = mis > 50 drop if notuse == 1

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

09 Jan 2019, 05:11

Maarten gave excellent code.

Since you said you failed to figure out how to do it with the user-written missings (SJ, Nick Cox), this is a toy example:

Code:

. webuse nlswork
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. * Let's say we wish to drop the variables with more than 20% by race

. by race, sort: missings report, percent

----------------------------------------------------------------------------------------------------------
-> race = white

Checking missings in all variables:
10576 observations with missing values

----------------------------
          |      #        %
----------+-----------------
      age |     15     0.07
      msp |      6     0.03
  nev_mar |      6     0.03
    grade |      1     0.00
 not_smsa |      6     0.03
   c_city |      6     0.03
    south |      6     0.03
 ind_code |    257     1.27
 occ_code |     81     0.40
    union |   6586    32.64
   wks_ue |   3924    19.44
   tenure |    321     1.59
    hours |     47     0.23
 wks_work |    478     2.37
----------------------------

----------------------------------------------------------------------------------------------------------
-> race = black

Checking missings in all variables:
4342 observations with missing values

----------------------------
          |      #        %
----------+-----------------
      age |      9     0.11
      msp |     10     0.12
  nev_mar |     10     0.12
    grade |      1     0.01
 not_smsa |      2     0.02
   c_city |      2     0.02
    south |      2     0.02
 ind_code |     80     0.99
 occ_code |     38     0.47
    union |   2618    32.52
   wks_ue |   1710    21.24
   tenure |    110     1.37
    hours |     17     0.21
 wks_work |    219     2.72
----------------------------

----------------------------------------------------------------------------------------------------------
-> race = other

Checking missings in all variables:
164 observations with missing values

--------------------------
          |    #        %
----------+---------------
 ind_code |    4     1.32
 occ_code |    2     0.66
    union |   92    30.36
   wks_ue |   70    23.10
   tenure |    2     0.66
    hours |    3     0.99
 wks_work |    6     1.98
--------------------------

. drop union wks_ue

This way, with only two lines, you can check how many variables went over the missing-value threshold ( if there were many, the inspecting process gets streamlined) and, finally, the "final decision" to delete the variables is, well, in your hands.

Hopefully that helps.

Last edited by Marcos Almeida; 09 Jan 2019, 05:20.

Best regards,

Marcos

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

09 Jan 2019, 06:42

In missings from the Stata Journal, the absence of any options to drop observations and/or variables if some but not all values are missing is deliberate. The help explains:

Creating entirely empty observations (rows) and variables (columns) is a habit of many spreadsheet users,
but neither is helpful in Stata datasets. The subcommands dropobs and dropvars should help users clean up.
Conversely, there is no explicit support here for dropping observations or variables with some missing and
some nonmissing values. Users so minded will find other subcommands of use as an intermediate step, but
multiple imputation might be a better way forward.
Comment
Qiguo Lian

Join Date: May 2014

Posts: 27
#5

09 Jan 2019, 22:56

Thank Maarten Buis, Marcos Almeida and Nick Cox for your kind help.
Comment

Announcement