Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • dropping missing variable based on a condition for huge dataset

    I have a dataset of 2,933 patients and 1,136 biomarkers.

    I want to know how to create a loop command since the data set is huge to drop variables that contain more than 5% missing's values ( variables are listed in the column, and patients' codes are listed in the rows )

    I used the command below

    missings report, percent sort

    but how to drop variable contain beyond the 5%

    I hope someone can help. note i use stata 17

    Thanks

    I tried Google and youtube, but I could not figure it
    0
    I tried Google and youtube, but I could not figure it
    0%
    0
    i tried stata documents as well
    0%
    0

  • #2
    Code:
    ds, has(type numeric)
    foreach var in `r(varlist)'{
        sort `var'
        if missing(`var'[`=ceil(.95*_N)']){
            drop `var'
        }
    }
    ds, has(type string)
    foreach var in `r(varlist)'{
        gsort -`var'
        if missing(`var'[`=ceil(.95*_N)']){
            drop `var'
        }
    }

    Comment


    • #3
      Cross-posted in slightly different form and answered at https://stackoverflow.com/questions/...missing-values

      Please note our policy on cross-posting, which is that you were asked to tell us about it. https://www.statalist.org/forums/help#crossposting

      My solution is different from that of Andrew Musau. It would be interesting to know how solutions compare for speed, but not interesting enough personally for me to set up experiments.

      On missings (Stata Journal), which I wrote: the intent to make what you want a little difficult was deliberate, and is documented in the help:

      Creating entirely empty observations (rows) and variables (columns) is a habit of many spreadsheet users, but neither is helpful in Stata datasets. The subcommands dropobs and
      dropvars should help users clean up. Conversely, there is no explicit support here for dropping observations or variables with some missing and some nonmissing values. Users so minded
      will find other subcommands of use as an intermediate step, but multiple imputation might be a better way forward.


      Just dropping variables because they are awkward may not be the best solution. Whether multiple imputation is better, or working with the data as they come is better, is hard, indeed I suspect impossible, to know in advance. I want butchery of datasets to be the user's decision.

      Comment

      Working...
      X