Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Outlier detection in stratified survey data

    Dear All,

    I have a stratified survey sample of 1000 observations with 8-10 observations in a single stratum. I am certain that there are outliers in the data, and I am also certain that the majority of the outliers do not form a part of the population.

    I would like to estimate population and subpopulation parameters with survey weights, such as the mean and standard deviation of the variables, and would also like to carry out a regression analysis with the data. I can clearly see, that the outliers have a high impact on the estimates. As I am certain that the majority of the outliers are not part of the population, I want these impacts removed from the estimates.

    I understand that there are outlier detection techniques that can deal with some of these issues, however, as I understand those methods largely rely on the random sample assumption. This sample, however is a stratified sample, designed to incorporate all stratums of the population under analysis. Within a single stratum, sampling was close to random, however the sample size of 8-10 is too small to use the conventional outlier detection methods.

    I am also aware of robust regression techniques that can be used to decrease the influence of outliers, however, as far as I know it is not well established how these shall be used in a weighted regression context.

    Can you perhaps suggest a systematic way to detect/remove the influence the outliers in the sample?

    Thank you.

  • #2
    Hello John,

    Welcome to the Stata Forum / Statalist,

    I assume your model is, basically, a linear regression. Since you didn't comment much about the probability weights (are they going to be calculated before or after the removal of the outliers?), I decided to divide my suggestions:

    For a general approach (under linear regression), before applying the survey design, you may wish to read this text, written by Richard Williams.

    For a general approach(under survey), you may have a scatter plot of the pweight variable versus the yvar and check whether there are extreme values for both at once. Additionally, you may calculate the mean for subpopulations, with and without the "suspicious" outliers, and decide about that.

    Finally, assuming you know very well the recommendations presented above and yet they seem to be "not appropriate enough" for the pattern of distribution of your data, and also assuming you are absolutely "certain" which observations are outliers, I gather you are sure because of a given "belief", say, variable X shall not have a value beyound Z in this particular group. Being this so, the exclusion could be done on the rationale itself.

    To end, there is always the risk of incurring in pitfalls when deleting outliers. But this you also know for sure.

    Hopefully that helped.
    Best regards,

    Marcos

    Comment

    Working...
    X