Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with outliers/extreme values by group

    Hello,

    First of all, I found that due to bug, I had multiple posts/replies regarding the similar questions in this forum and sorry for the mess and confusion.

    Now, in stata, I am trying to do robustness checks of outliers (or extreme values) by group and I am encountering issues with its implementation.

    First, I would like to trim/winsorize by group using this code:

    Code:
    foreach var in `varlist' {
    bys group: replace `var' = `r(p1)' if `var' < `r(p1)'
    bys group: replace `var' = `r(p99)' if `var' > `r(p99)'
    }
    however, the error says:
    Code:
    "if not found"
    . when i try only

    Code:
    foreach var in `varlist' {
    replace `var' = `r(p1)' if `var' < `r(p1)'
    replace `var' = `r(p99)' if `var' > `r(p99)'
    }
    this, it works perfectly fine but I wanted to do it by group. Could someone help me with this?

    Second, I am running multiple regressions with panel data (that has multiple countries/sectors/years). and I would like to ensure that for each file the regression for different LHS are based on the common sample (i.e. ensure that results across LHS are not driven by differences in sample). For instance, if I say i have list of LHS: var1 var2 var3, then all three variables are missing if either var1 or var2 or var3 is missing. This way it doesn't depend on the fixed effects and i should get a consistent sample. And by doing this i can check that you are not loosing too many observations at the same time. Having this in mind, I wanted to implement this method but i don't have much idea how to actually make them in Stata code. Maybe someone could give me any insights on the structure/code that I can use, please?

    Thanks so much in advance!

  • #2
    There are two problems with your code. First, you want to access the locals r(p1) and r(p99), but these are never computed in your code as there is no call to summarize. Second, when replacing values larger than r(p99), you will replace missing values with the p99 value. This is a potentially fatal error. You can solve this task by using a double loop as such:

    Code:
    sysuse nlsw88, clear
    levelsof industry, local(groups)
    foreach G of local groups {
        foreach VAR of varlist wage hours {
            sum `VAR' if industry == `G', det
            replace `VAR' = `r(p1)' if industry == `G' & `VAR' < `r(p1)'
            replace `VAR' = `r(p99)' if industry == `G' & `VAR' > `r(p99)' & !missing(`VAR')
        }
    }
    Here, industry is the group variable and wage and hours are being winsorized.
    Best wishes

    Stata 18.0 MP | ORCID | Google Scholar

    Comment


    • #3
      Cross-posted (and answered) at https://stackoverflow.com/questions/...alues-by-group

      We ask that you tell us about cross-posting.

      Felix Bittmann makes an excellent point about missings.

      Comment


      • #4
        Thank you Felix and Nick for your reply, it is very helpful.
        and thanks for your advice on cross-posting.

        Felix Bittmann i have additional clarification question:
        if i use
        Code:
        foreach var of varlist {
        bys group: replace ‘var’ = r(p1) if ‘var’ < r(p1)
        bys group: replace ‘var’ = r(p99) if ‘var’ < r(p99) & !missing(‘var’)
        then is this equivalent to your code from #2?

        Comment


        • #5
          Absolutely not equivalent, for the reason Felix explained in this thread and I explained on Stack Overflow. You have fixed one problem only, missing values.

          Comment


          • #6
            On the assumption that both Nick Cox and Felix Bittman are asleep at this time of day in Europe, I'll jump in here.

            No, the code you show in #4 is definitely not equivalent to the code in #2. The code in #2 is correct for your purpose; that in #4 is not.

            The difference is that in #2, there are two nested loops. Consequently, the -summarize- command is executed separately for each group, as well as separately for each variable VAR, and then compares the values of the current VAR to the current `r(p1)' and `r(p9)'.

            By contrast, in #4, you have only one loop, with some -by group- commands, which are like loops, but are not exactly the same thing, nested within it. It contains no -summarize- command at all. (Consequently `r(p1)' and `r(p99)' are undefined and will cause syntax error messages.) Even if you add a -summarize- command on the line after -foreach var of varlist-, for each var it will be executed only once for the entire data set, not by group. Consequently the values of `r(p1)' and `r(p99)' referred to in the -replace `var' = ...- commands will not be group specific.

            Added: Crossed with #5, and apparently Nick is still up!

            Comment


            • #7
              Nick Cox Clyde Schechter Thank you all for your kind and detailed reply.
              It is indeed super helpful and I now got the point, I appreciate it a lot.

              Comment

              Working...
              X