Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extremes

    Hi everyone

    I have a big dataset, with 280 variables (made a global varlist with all of them). I need to identify whether if there are strange extreme values so as to drop them

    I have read about it and found Nick Cox's extremes command, but gives me a strange output. I don't know whether if I'm not performing it well or does not work when finding non-numeric variables.

    Find attached a picture of the output

    Thanks,

    Francesc
    Attached Files

  • #2
    Please show output using CODE delimiters, not as attachments (FAQ #12).

    extremes is from SSC, as you are asked to explain (FAQ #12 again). It shows extremes on the first variable you name and then values in the same observations on all of the other variables you name. Evidently you supplied a global macro which includes lots of variables, so you got what you asked for.

    extremes is not, and cannot be, white magic that finds extremes in your data on any terms but those defined in the help. It certainly does not purport to think about your data in any multivariate space (even bivariate).

    What would "extreme" non-numeric variables (meaning string, presumably) look like?

    Comment


    • #3
      Dear Nick. Thanks for your prompte response and my apoligies for not being aware about FAQ#12 (I attached a picture because wasn't capable of copy-pasting the output sufficiently well).
      I was then misunderstood about extremes. I thought that would display the 5 min and max values of my global varlist, though I did understand that would not work on strings.
      What should I use, then? Thank you very much for your counseling. Francesc

      Comment


      • #4
        With hundreds of variables it is hard to automate finding outliers and much depends on what you expect. I can't offer good ideas on what will work for you. Sometimes outliers are evident on scatter plots of important principal components. I would focus on variables with high skewness and ignore categorical variables in the first instance.

        Comment


        • #5
          Thank you Nick. In absence of a better solution, I will take a look at those two hundred plots and skewness

          Comment


          • #6
            You might reduce the amount of work by looking only at the variables that you are planning to include in your regression.

            Comment

            Working...
            X