Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -missings- now available from SSC: new program for managing missings

    Thanks to Kit Baum as usual, a new program missings is now available from SSC.

    Use

    Code:
     
    ssc inst missings
    to install.

    Stata version 9 up is required (but see below for a note for any people on
    version 9 itself).

    missings combines various basic utilities for managing variables that
    (may) have missing values. The syntax starts

    missings subcommand ...

    with a choice of six subcommands. By default "missing" means numeric
    missing (that is, sysmiss . or one of the extended missing values .a to
    .z) for numeric variables and empty or "" for string variables.

    If a varlist is not specified, then it is interpreted by default as all variables.

    missings report issues a report on the number of missing values in
    varlist. By default counts of missings are given by variables;
    optionally counts are given by observations.

    missings list lists observations with missing values in varlist.

    missings table tabulates observations by the number of missing values in
    varlist.

    missings tag generates a variable containing the number of missing
    values in each observation on varlist.

    missings dropvars drops any variables in varlist which are missing on
    all values.

    missings dropobs drops any observations which are missing on all values
    in varlist.

    missings is intended to unite and supersede the main ideas of nmissing
    (Cox 1999, 2001a, 2003, 2005) and dropmiss (Cox 2001b, 2008), together
    with a more up-to-date take on what may be useful given other commands.

    Now creating entirely empty observations (rows) and/or variables
    (columns) is a habit of many spreadsheet users, but neither is helpful
    in Stata datasets. The subcommands dropobs and dropvars should help
    users clean up.

    Conversely, there is no explicit support in missings for dropping
    observations or variables with some missing and some non-missing values.
    Users so minded will find other subcommands of use as an intermediate
    step, but multiple imputation might be a better way forward.

    Let's look at a few examples. We can read in a fairly complicated
    dataset and get a concise report on missings. We can specify a threshold
    to see which variables look particularly poor.

    Code:
     
    . webuse nlswork, clear 
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . missings report
    
    Checking missings in all variables:
    15082 observations with missing values
    
    age           24
    msp           16
    nev_mar       16
    grade          2
    not_smsa       8
    c_city         8
    south          8
    ind_code     341
    occ_code     121
    union       9296
    wks_ue      5704
    tenure       433
    hours         67
    wks_work     703
    
    . missings report, min(1000)
    
    Checking missings in all variables:
    15082 observations with missing values
    
    union     9296
    wks_ue    5704
    We can home in on the observations with missing values. Again, a
    threshold can be specified:

    Code:
     
    . missings list, min(5)
    
    Checking missings in all variables:
    15082 observations with missing values
    
           +------------------------------------------------------------+
     6924. | age | msp | nev_mar | not_smsa | c_city | south | ind_code |
           |  26 |   . |       . |        0 |      1 |     0 |       11 |
           |------------------------------------------------------------|
           | occ_code | union | wks_ue |   tenure  | hours  | wks_work  |
           |        8 |     . |      . | .0833333  |    40  |        .  |
           +------------------------------------------------------------+
    
           +------------------------------------------------------------+
    19002. | age | msp | nev_mar | not_smsa | c_city | south | ind_code |
           |  31 |   . |       . |        0 |      1 |     1 |        7 |
           |------------------------------------------------------------|
           | occ_code | union | wks_ue |   tenure  | hours  | wks_work  |
           |        8 |     . |      . | .3333333  |     .  |        .  |
           +------------------------------------------------------------+
    
           +------------------------------------------------------------+
    21220. | age | msp | nev_mar | not_smsa | c_city | south | ind_code |
           |  26 |   1 |       0 |        . |      . |     . |        7 |
           |------------------------------------------------------------|
           | occ_code | union | wks_ue |   tenure  | hours  | wks_work  |
           |        3 |     . |      . | .0833333  |    40  |        .  |
           +------------------------------------------------------------+
    
           +------------------------------------------------------------+
    22493. | age | msp | nev_mar | not_smsa | c_city | south | ind_code |
           |  32 |   1 |       0 |        1 |      0 |     0 |        . |
           |------------------------------------------------------------|
           | occ_code | union | wks_ue |   tenure  | hours  | wks_work  |
           |        . |     . |      . |      .75  |     .  |        .  |
           +------------------------------------------------------------+
    
           +------------------------------------------------------------+
    22628. | age | msp | nev_mar | not_smsa | c_city | south | ind_code |
           |  39 |   0 |       0 |        0 |      1 |     0 |        . |
           |------------------------------------------------------------|
           | occ_code | union | wks_ue |   tenure  | hours  | wks_work  |
           |        3 |     . |      . |      1.5  |     .  |        .  |
           +------------------------------------------------------------+
    Another basic subcommand reports counts of missings across all variables
    specified, or alternatively all variables:

    Code:
     
    . missings table 
    
    Checking missings in all variables:
    15082 observations with missing values
    
           # of |
        missing |
         values |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |     13,452       47.14       47.14
              1 |     13,790       48.33       95.47
              2 |        964        3.38       98.85
              3 |        291        1.02       99.87
              4 |         32        0.11       99.98
              5 |          2        0.01       99.99
              6 |          3        0.01      100.00
    ------------+-----------------------------------
          Total |     28,534      100.00
    Where appropriate, a by: prefix can be specified to get a breakdown:

    Code:
    . bysort race: missings table 
    
    -------------------------------------------------------------------------------
    -> race = white
    
    Checking missings in all variables:
    10576 observations with missing values
    
           # of |
        missing |
         values |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      9,604       47.59       47.59
              1 |      9,672       47.93       95.52
              2 |        677        3.35       98.88
              3 |        199        0.99       99.86
              4 |         25        0.12       99.99
              5 |          1        0.00       99.99
              6 |          2        0.01      100.00
    ------------+-----------------------------------
          Total |     20,180      100.00
    
    -------------------------------------------------------------------------------
    -> race = black
    
    Checking missings in all variables:
    4342 observations with missing values
    
           # of |
        missing |
         values |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      3,709       46.07       46.07
              1 |      3,966       49.26       95.33
              2 |        278        3.45       98.78
              3 |         89        1.11       99.89
              4 |          7        0.09       99.98
              5 |          1        0.01       99.99
              6 |          1        0.01      100.00
    ------------+-----------------------------------
          Total |      8,051      100.00
    
    -------------------------------------------------------------------------------
    -> race = other
    
    Checking missings in all variables:
    164 observations with missing values
    
           # of |
        missing |
         values |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |        139       45.87       45.87
              1 |        152       50.17       96.04
              2 |          9        2.97       99.01
              3 |          3        0.99      100.00
    ------------+-----------------------------------
          Total |        303      100.00
    It's sometimes helpful to have a count of missings in a variable. Many
    people will know that there is already an egen function for this, but oh
    well.

    Code:
    . missings tag, generate(nmissing)
    
    Checking missings in all variables:
    15082 observations with missing values
    This dataset does not have any variables or observations that are
    entirely missing, but let's see how missings could be used to remove
    such.

    Code:
    . gen newt = "" 
    (28,534 missing values generated)
    
    . gen frog = . 
    (28,534 missing values generated)
    
    . gen toad = .a 
    (28,534 missing values generated)
    
    . missings dropvars newt frog toad, force sysmiss
    
    Checking missings in newt frog toad:
    28534 observations with system missing values
    
    note: newt frog dropped
    
    . missings dropvars toad, force sysmiss
    
    Checking missings in toad:
    0 observations with system missing values
    
    note: no variables qualify
    
    . missings dropvars toad, force 
    
    Checking missings in toad:
    28534 observations with missing values
    
    note: toad dropped
    
    . set obs 30000 
    number of observations (_N) was 28,534, now 30,000
    
    . missings dropobs, force 
    
    Checking missings in idcode year birth_yr age race msp nev_mar grade collgrad
        not_smsa c_city south ind_code occ_code union wks_ue ttl_exp tenure hours
        wks_work ln_wage nmissing:
    16548 observations with missing values
    
    (1,466 observations deleted)
    There are various safety features here. Most obviously, if the dataset
    in memory has not been saved, then you must show awareness that you are
    exerting brute force. In addition, extended missing values .a to .z are
    supposed to be informative, so you can specify sysmiss to spell out that
    only system missings can be dropped.


    Cox, N.J. 1999. Numbers of missing and present values. Stata Technical
    Bulletin
    49: 7-8.

    2001a. Enhancements to numbers of missing and present values. Stata Technical
    Bulletin
    60: 2-3.

    2001b. Dropping variables or observations with missing values. Stata Technical
    Bulletin
    60: 7-8.

    2003. Numbers of present and missing values. Stata Journal 3: 449.

    2005. Numbers of present and missing values. Stata Journal 5: 607.

    2008. Dropping variables or observations with missing values. Stata Journal 8: 594.

    (So what about users of version 9? The code works fine in those
    versions. The problem is just that you need to rename the help file from
    .sthlp to .hlp.)



  • #2
    Thanks again to Kit Baum, missings has been updated on SSC. The update adds options to show percents as well as counts for missing values in missings report. This is allowed both for counting missings by variables and for counting by observations.

    To download, use ssc install missings or ssc install missings, replace as appropriate.

    Code:
    . webuse nlswork, clear
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . missings report, percent
    
    Checking missings in all variables:
    15082 observations with missing values
    
    age           24     0.08%
    msp           16     0.06%
    nev_mar       16     0.06%
    grade          2     0.01%
    not_smsa       8     0.03%
    c_city         8     0.03%
    south          8     0.03%
    ind_code     341     1.20%
    occ_code     121     0.42%
    union       9296    32.58%
    wks_ue      5704    19.99%
    tenure       433     1.52%
    hours         67     0.23%
    wks_work     703     2.46%
    
    . missings report, percent format(%2.1f)
    
    Checking missings in all variables:
    15082 observations with missing values
    
    age           24     0.1%
    msp           16     0.1%
    nev_mar       16     0.1%
    grade          2     0.0%
    not_smsa       8     0.0%
    c_city         8     0.0%
    south          8     0.0%
    ind_code     341     1.2%
    occ_code     121     0.4%
    union       9296    32.6%
    wks_ue      5704    20.0%
    tenure       433     1.5%
    hours         67     0.2%
    wks_work     703     2.5%
    
    . missings report, obs percent format(%2.1f) min(5)
    
    Checking missings in all variables:
    15082 observations with missing values
    
           +-----------------------+
           | # missing   % missing |
           |-----------------------|
     6924. |         5        23.8 |
    19002. |         6        28.6 |
    21220. |         6        28.6 |
    22493. |         6        28.6 |
    22628. |         5        23.8 |
           +-----------------------+

    Comment


    • #3
      Love it Nick!!!! Thanks!
      Alfonso Sanchez-Penalver

      Comment


      • #4
        Now written up as http://www.stata-journal.com/article...article=dm0085

        Comment


        • #5
          This has been of great help to me. Thanks a lot Nick Cox...

          Comment


          • #6
            For different reasons

            Code:
            search missings 
            and

            Code:
            search cox, author 
            aren't especially helpful.

            This is a flag that dm0085_1 finds the latest update of missings as I write. Typing that gives you a clickable link to install.

            Code:
            . search dm0085_1, entry
            
            Search of official help files, FAQs, Examples, SJs, and STBs
            
            SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
                    (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                    Q3/17   SJ 17(3):779
                    identify() and sort options have been added

            Comment


            • #7
              missings was updated further in SJ 20(4). In an updated Stata you can get a clickable link by

              Code:
              . search dm0085_2, entry
              
              Search of official help files, FAQs, Examples, and Stata Journals
              
              SJ-20-4 dm0085_2  . . . . . . . . . . . . . . . . Software update for missings
                      (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                      Q4/20   SJ 20(4):1028--1030
                      sorting has been extended for missings report
              
              (end of search)
              Last edited by Nick Cox; 25 Jan 2021, 03:19.

              Comment


              • #8
                while type missing command my Stata 14 didn't recognize missing command. After installation I am writing,
                missings report

                Comment


                • #9
                  missings must be installed before it can be used, regardless of the version of Stata you are using. Typing



                  Code:
                  net sj 20-4 dm0085_2
                  
                  net install dm0085_2
                  should work for you. The version on SSC is not up-to-date. .

                  Comment


                  • #10
                    Thanks It worked.

                    Comment

                    Working...
                    X