dropping missing variable based on a condition for huge dataset

Hana abukhadijah

Join Date: May 2023
Posts: 1

dropping missing variable based on a condition for huge dataset

08 May 2023, 03:21

I have a dataset of 2,933 patients and 1,136 biomarkers.

I want to know how to create a loop command since the data set is huge to drop variables that contain more than 5% missing's values ( variables are listed in the column, and patients' codes are listed in the rows )

I used the command below

missings report, percent sort

but how to drop variable contain beyond the 5%

I hope someone can help. note i use stata 17

Thanks

I tried Google and youtube, but I could not figure it

0 Votes

I tried Google and youtube, but I could not figure it	0%	0 votes
i tried stata documents as well	0%	0 votes

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10287

08 May 2023, 03:35

Code:

ds, has(type numeric)
foreach var in `r(varlist)'{
    sort `var'
    if missing(`var'[`=ceil(.95*_N)']){
        drop `var'
    }
}
ds, has(type string)
foreach var in `r(varlist)'{
    gsort -`var'
    if missing(`var'[`=ceil(.95*_N)']){
        drop `var'
    }
}

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35783
#3

08 May 2023, 03:55

Cross-posted in slightly different form and answered at https://stackoverflow.com/questions/...missing-values

Please note our policy on cross-posting, which is that you were asked to tell us about it. https://www.statalist.org/forums/help#crossposting

My solution is different from that of Andrew Musau. It would be interesting to know how solutions compare for speed, but not interesting enough personally for me to set up experiments.

On missings (Stata Journal), which I wrote: the intent to make what you want a little difficult was deliberate, and is documented in the help:

Creating entirely empty observations (rows) and variables (columns) is a habit of many spreadsheet users, but neither is helpful in Stata datasets. The subcommands dropobs and
dropvars should help users clean up. Conversely, there is no explicit support here for dropping observations or variables with some missing and some nonmissing values. Users so minded
will find other subcommands of use as an intermediate step, but multiple imputation might be a better way forward.

Just dropping variables because they are awkward may not be the best solution. Whether multiple imputation is better, or working with the data as they come is better, is hard, indeed I suspect impossible, to know in advance. I want butchery of datasets to be the user's decision.
2 likes
Comment

Announcement

dropping missing variable based on a condition for huge dataset

Comment

Comment