Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Outlier

    Hi everybody
    I have the model below. I found that I have one outlier that has influence on the coefficients. I do not want to drop the outlier. Is there any other way that I can modify the model without dropping the outlier?
    . reg infmort lpcinc lphysic lpopul, hc3

    Linear regression Number of obs = 51
    F( 3, 47) = 0.56
    Prob > F = 0.6413
    R-squared = 0.1391
    Root MSE = 2.0581

    ------------------------------------------------------------------------------
    | Robust HC3
    infmort | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    lpcinc | -4.684585 4.942597 -0.95 0.348 -14.62781 5.258638
    lphysic | 4.153227 7.051872 0.59 0.559 -10.03331 18.33976
    lpopul | -.0878245 .768478 -0.11 0.909 -1.633803 1.458154
    _cons | 33.85875 22.82631 1.48 0.145 -12.06186 79.77937


  • #2
    you can always downweight it; if you want an automatic procedure:
    Code:
    search bound
    scroll down to an actual program of that name (SJ), and follow the instructions to download and install; then see
    Code:
    help bound

    Comment


    • #3
      As you see, I did robust but still it did not give me the coefficient close to ones when I dropped the outliers.

      Comment


      • #4
        "Robust" (Huber-White-Eicker-sandwich-whatever) standard errors have nothing to do with providing a fit robust or resistant to outliers.

        You don't show whether the problem is in the response, one or more predictors or both. Posting some graphs of the data might help us advise.

        Comment


        • #5

          When we do the regression with including DC, the result shows the higher infant mortality with more physician per capita. And infant mortality does not appear to be related to population size. Higher per capita income is estimated to lower infant mortality.
          There is some characteristic that made DC different from other states. When we exclude DC from regression we get other results, more physician lowered the infant mortality. The effect of income fell. Also, infant mortality rate were higher in more populous states. I am looking for a way to modify my model without dropping DC.

          infmort is the infant mortality rate in state (per 1,000 live births)
          lpcinc is the natural log of per capita income
          lpopul is natural log of the number of medical doctors per 100,000 civilian population
          lpopul is the natural log of the population (in thousands)


          . reg infmort lpcinc lphysic lpopul if dc==0, hc3

          Linear regression Number of obs = 50
          F( 3, 46) = 4.16
          Prob > F = 0.0109
          R-squared = 0.2732
          Root MSE = 1.2464

          ------------------------------------------------------------------------------
          | Robust HC3
          infmort | Coef. Std. Err. t P>|t| [95% Conf. Interval]
          -------------+----------------------------------------------------------------
          lpcinc | -.5669247 1.925574 -0.29 0.770 -4.442905 3.309056
          lphysic | -2.74184 1.36443 -2.01 0.050 -5.488296 .0046156
          lpopul | .6292351 .227944 2.76 0.008 .1704076 1.088063
          _cons | 23.95478 13.88081 1.73 0.091 -3.985841 51.89541
          ------------------------------------------------------------------------------


          . reg infmort lpcinc lphysic lpopul, hc3

          Linear regression Number of obs = 51
          F( 3, 47) = 0.56
          Prob > F = 0.6413
          R-squared = 0.1391
          Root MSE = 2.0581

          ------------------------------------------------------------------------------
          | Robust HC3
          infmort | Coef. Std. Err. t P>|t| [95% Conf. Interval]
          -------------+----------------------------------------------------------------
          lpcinc | -4.684585 4.942597 -0.95 0.348 -14.62781 5.258638
          lphysic | 4.153227 7.051872 0.59 0.559 -10.03331 18.33976
          lpopul | -.0878245 .768478 -0.11 0.909 -1.633803 1.458154
          _cons | 33.85875 22.82631 1.48 0.145 -12.06186 79.77937
          ------------------------------------------------------------------------------

          I also attached the screen shot of part of the dataset.
          Attached Files

          Comment


          • #6
            So, 50 US states plus the District of Columbia , Unfortunately, a screenshot of some of the data is not much help and there are no graphs and no answer to is this response? predictors? or both. Still, it's hardly a surprise that DC is unusual.

            I suggest that you do read and act on https://www.statalist.org/forums/help#stata -- which explains about screenshots -- and show us the result of

            Code:
            dataex  infmort lpcinc lphysic lpopul dc

            Comment


            • #7
              Sorry, but your link is useless to anyone but people at your University (not me, not almost all of us). I made a very precise request in #6. I can't advise further unless you comply.

              Comment


              • #8
                I can not get your request in #6. Would you please explain a little bit.

                Comment


                • #9
                  I gave you a link which you should read, please.

                  Comment


                  • #10
                    I am getting this :


                    . dataex infmort lpcinc lphysic lpopul dc
                    unrecognized command: dataex
                    r(199);

                    Comment


                    • #11
                      Again, explained in the link.

                      12.2 What to say about your data

                      As from Stata 15.1 (and 14.2 from 19 December 2017), dataex is included with the official Stata distribution. Users of Stata 15 (or 14) must update to benefit from this.

                      Users of earlier versions of Stata must install dataex from SSC before they can use it. Type ssc install dataex in your Stata.
                      It seems also that you are not using the latest version of Stata, which you are asked to explain.

                      Every time you post to Statalist you are prompted to read the FAQ Advice first. Not doing this just slows down progress in a thread and obliges other people to explain what you're expected to find out for yourself.

                      Comment


                      • #12
                        Does this work?

                        input float(infmort lpcinc lphysic lpopul) byte dc
                        7.1 9.954703 5.298317 7.011214 0
                        9 9.593082 5.062595 6.683361 0
                        8.8 9.841346 5.351858 8.099858 0
                        9.4 9.761175 5.278115 8.540323 0
                        9.6 9.729967 5.056246 8.620472 0
                        10.7 9.919705 5.356586 9.344084 0
                        8.8 9.690851 5.283204 8.206584 0
                        9.6 9.837615 5.459586 9.38278 0
                        8.2 9.767382 5.241747 8.495357 0
                        9.2 9.626019 4.990433 8.053887 0
                        9.6 10.00188 5.752573 9.797571 0
                        9.6 9.83124 5.337538 9.467924 0
                        10.6 9.704 5.247024 8.79921 0
                        10.1 9.938324 5.293305 6.50129 0
                        10.3 9.672123 5.278115 8.492286 0
                        8.6 9.709114 4.934474 6.118097 0
                        8 9.626284 5.135798 6.459905 0
                        8.4 9.785154 5.164786 7.815207 0
                        11.7 9.618668 5.081404 8.156797 0
                        9.2 9.549452 5.010635 7.762596 0
                        8.3 9.757073 5.147494 7.363914 0
                        7.9 10.14753 5.720312 8.097731 0
                        10.8 9.60905 5.062595 8.304248 0
                        8.5 9.614738 5.123964 8.212026 0
                        7.3 9.835744 5.393628 8.383662 0
                        8.1 9.731987 5.01728 7.929126 0
                        9 10.12571 5.505332 8.952864 0
                        7.5 9.54938 5.220356 7.451822 0
                        8.4 9.887358 5.068904 7.091742 0
                        7.8 9.838309 5.361292 8.490233 0
                        10.2 9.887307 5.361292 8.730206 0
                        11.1 9.568015 5.236442 8.34759 0
                        8.3 9.743201 5.32301 7.952263 0
                        10.7 9.814492 5.220356 9.137232 0
                        12.1 9.449357 4.890349 7.852828 0
                        9.5 9.994926 5.811141 8.472405 0
                        7.9 9.93047 5.497168 10.30092 0
                        12.4 9.743378 5.164786 8.776167 0
                        9.9 9.526755 5.111988 7.491645 0
                        9.8 9.765489 5.278115 9.291644 0
                        10.1 9.652844 4.941642 6.54535 0
                        8.1 9.717158 5.164786 9.740204 0
                        8.7 9.624897 4.828314 6.914731 0
                        6.7 9.926276 5.463832 7.010312 0
                        8.1 9.840069 5.537334 6.910751 0
                        7 10.02384 5.820083 8.702178 0
                        6.2 9.748295 5.181784 7.113142 0
                        6.4 9.777357 5.53339 6.33328 0
                        10.5 9.945924 4.983607 6.309918 0
                        9 9.555631 5.209486 7.323171 0
                        20.7 10.08101 6.421622 6.408529 1
                        end

                        Comment


                        • #13
                          Yes, that is readable; thanks. In terms of what to do with DC as an outlier, I think I would work with infant mortality on log scale and use an indicator predictor for DC. I am not especially fond of that as an ad hoc technique, but it beats leaving out DC as too awkward to handle.

                          Code:
                          . gen linfmort = log(infmort)
                          
                          . regress linfmort lpcinc lphysic lpopul dc
                          
                                Source |       SS           df       MS      Number of obs   =        51
                          -------------+----------------------------------   F(4, 46)        =     13.67
                                 Model |   1.0389911         4  .259747774   Prob > F        =    0.0000
                              Residual |  .874199802        46  .019004344   R-squared       =    0.5431
                          -------------+----------------------------------   Adj R-squared   =    0.5033
                                 Total |   1.9131909        50  .038263818   Root MSE        =    .13786
                          
                          ------------------------------------------------------------------------------
                              linfmort |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                          -------------+----------------------------------------------------------------
                                lpcinc |  -.0440503   .1815224    -0.24   0.809    -.4094359    .3213353
                               lphysic |  -.3214726    .131702    -2.44   0.019     -.586575   -.0563702
                                lpopul |   .0726595   .0211368     3.44   0.001     .0301134    .1152057
                                    dc |   1.344865   .1956737     6.87   0.000      .950994    1.738735
                                 _cons |   3.728076   1.373624     2.71   0.009     .9631141    6.493037
                          ------------------------------------------------------------------------------
                          The graphs come from multqplot (Stata Journal) and favplots (SSC). Important detail: DC is an outlier on physicians too, and working with logs does not tame the outlier for either variable (nor would I really expect that here). Different standard errors won;t help here. Quantile regression is another possibility, however.

                          Click image for larger version

Name:	multq.png
Views:	1
Size:	47.5 KB
ID:	1484871
                          Click image for larger version

Name:	favplots.png
Views:	1
Size:	63.4 KB
ID:	1484872


                          That's just some thoughts, and the thread remains wide open for other points of view.

                          Comment

                          Working...
                          X