Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Efficient way to sort data and locate strange or missing values

    I was wondering how to best sort data to locate strange or missing values. The data file has over 4 million observations, I cannot "tab" results for a lot of the variables because the variables take on too many values. One way that I've been approaching this is by using "gsort variable_name, mfirst" and then browsing the sorted data manually to see if there are any odd values. Is there a more efficient way to approach this?

    Thanks!

    Andrew

  • #2
    Have you looked at -codebook-?

    Comment


    • #3
      What would you regard as odd?

      Comment


      • #4
        -misstable- may help in identifying patterns of missing data.

        "Strange" on the other hand, I don't know what to make of. You might try something like "tab varname if _n>`i' & _n<`j'" but since I don't know what you're looking for and what you intend do do about these "strange" values, hard to give concrete advice.
        Last edited by ben earnhart; 24 Nov 2014, 18:24.

        Comment


        • #5
          If strange means "very small or very big," -summarize x, detail" will list the 5 largest and smallest values for a variable, but it does not save these values or identify them. To overcome that, you could use -_pctile- as follows:
          Code:
          // Make data with 4e6 observations and 7 variables for a demo.
          clear
          set obs 400000
          gen long id = _n
          forval i = 1/7 {
            gen x`i' = runiform()
          }
          //
          // List the id and values of the 5 largest and smallest values for each variable.  The _pctile command
          // is one way to find the approximate cutpoints to define these values.
          local howmany = 5
          local lowptile = 100 * (`howmany'/_N) 
          local highptile = 100 - `lowptile'
          foreach v of varlist  x* {
             _pctile `v', percentiles(`lowptile' `highptile') // stores cutpoints in r(r1) and r(r2)
             di "Variable `v', observations with `howmany' smallest and largest values."
             list id `v' if !missing(`v') & !inrange(`v', r(r1), r(r2))
             di "____________________________________________________" _newline
          }
          //
          Regards, Mike

          Comment


          • #6
            See also (e.g.) extremes from SSC.

            Comment

            Working...
            X