Efficient way to sort data and locate strange or missing values

Andrew Hua

Join Date: Aug 2014

Posts: 12
#1

Efficient way to sort data and locate strange or missing values

24 Nov 2014, 14:53

I was wondering how to best sort data to locate strange or missing values. The data file has over 4 million observations, I cannot "tab" results for a lot of the variables because the variables take on too many values. One way that I've been approaching this is by using "gsort variable_name, mfirst" and then browsing the sorted data manually to see if there are any odd values. Is there a more efficient way to approach this?

Thanks!

Andrew
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

24 Nov 2014, 15:05

Have you looked at -codebook-?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#3

24 Nov 2014, 16:52

What would you regard as odd?
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#4

24 Nov 2014, 18:19

-misstable- may help in identifying patterns of missing data.

"Strange" on the other hand, I don't know what to make of. You might try something like "tab varname if _n>`i' & _n<`j'" but since I don't know what you're looking for and what you intend do do about these "strange" values, hard to give concrete advice.

Last edited by ben earnhart; 24 Nov 2014, 18:24.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2449

25 Nov 2014, 10:12

If strange means "very small or very big," -summarize x, detail" will list the 5 largest and smallest values for a variable, but it does not save these values or identify them. To overcome that, you could use -_pctile- as follows:

Code:

// Make data with 4e6 observations and 7 variables for a demo.
clear
set obs 400000
gen long id = _n
forval i = 1/7 {
  gen x`i' = runiform()
}
//
// List the id and values of the 5 largest and smallest values for each variable.  The _pctile command
// is one way to find the approximate cutpoints to define these values.
local howmany = 5
local lowptile = 100 * (`howmany'/_N) 
local highptile = 100 - `lowptile'
foreach v of varlist  x* {
   _pctile `v', percentiles(`lowptile' `highptile') // stores cutpoints in r(r1) and r(r2)
   di "Variable `v', observations with `howmany' smallest and largest values."
   list id `v' if !missing(`v') & !inrange(`v', r(r1), r(r2))
   di "____________________________________________________" _newline
}
//

Regards, Mike

Comment

Nick Cox

Join Date: Mar 2014

Posts: 36058
#6

25 Nov 2014, 11:22

See also (e.g.) extremes from SSC.
1 like
Comment

Announcement

Efficient way to sort data and locate strange or missing values

Comment

Comment

Comment

Comment

Comment