Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to deal with outliers?

    Hi,

    I am performing fixed effects panel regression and some of my control variables have outliers.
    For eg:

    Code:
     univar land_amount
                                            -------------- Quantiles --------------
    Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
    -------------------------------------------------------------------------------
    land_amount    5734     1.87     4.17     0.00     0.02     0.55     2.50   100.00
    -------------------------------------------------------------------------------
    Now had 100 been an one off figure, there was a possibility that it is a recording error. However, a tabulation of the variable shows there are several such uncharacteristically high (given the context of the data) values, which makes me believe that these are not recording errors.
    Code:
    tab land_amount
    
    land_amount |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |        504        8.79        8.79
            .01 |        382        6.66       15.45
            .02 |        694       12.10       27.55
            .03 |        484        8.44       36.00
            .04 |        150        2.62       38.61
            .05 |        167        2.91       41.52
            .06 |         36        0.63       42.15
            .06 |          2        0.03       42.19
            .07 |         14        0.24       42.43
            .08 |         10        0.17       42.61
            .09 |          6        0.10       42.71
             .1 |          1        0.02       42.73
             .1 |         23        0.40       43.13
            .11 |          5        0.09       43.22
            .12 |          4        0.07       43.29
            .12 |          2        0.03       43.32
            .13 |          1        0.02       43.34
            .14 |          2        0.03       43.37
            .15 |         11        0.19       43.56
            .16 |          1        0.02       43.58
            .18 |          1        0.02       43.60
            .18 |          1        0.02       43.62
            .19 |          1        0.02       43.63
             .2 |         23        0.40       44.04
            .21 |          2        0.03       44.07
            .22 |          3        0.05       44.12
            .23 |          8        0.14       44.26
            .24 |          1        0.02       44.28
            .24 |          1        0.02       44.30
            .25 |          3        0.05       44.35
            .26 |          1        0.02       44.37
            .27 |          9        0.16       44.52
            .28 |          8        0.14       44.66
            .29 |          1        0.02       44.68
             .3 |         12        0.21       44.89
            .31 |         10        0.17       45.06
            .32 |          2        0.03       45.10
            .32 |         18        0.31       45.41
            .33 |          6        0.10       45.52
            .34 |          2        0.03       45.55
            .35 |          4        0.07       45.62
            .35 |          3        0.05       45.67
            .36 |          1        0.02       45.69
            .36 |          1        0.02       45.71
            .37 |          4        0.07       45.78
            .38 |          3        0.05       45.83
            .39 |          1        0.02       45.85
             .4 |         10        0.17       46.02
            .41 |          4        0.07       46.09
            .42 |          1        0.02       46.11
            .42 |         14        0.24       46.36
            .43 |          7        0.12       46.48
            .44 |          2        0.03       46.51
            .45 |          1        0.02       46.53
            .46 |          1        0.02       46.55
            .47 |          1        0.02       46.56
            .49 |          1        0.02       46.58
             .5 |         15        0.26       46.84
            .51 |         20        0.35       47.19
            .52 |         67        1.17       48.36
            .53 |         50        0.87       49.23
            .54 |         28        0.49       49.72
            .55 |         19        0.33       50.05
            .56 |          3        0.05       50.10
            .57 |          4        0.07       50.17
            .59 |          1        0.02       50.19
             .6 |          5        0.09       50.28
            .62 |          7        0.12       50.40
            .63 |          5        0.09       50.49
            .64 |          2        0.03       50.52
            .67 |          1        0.02       50.54
            .68 |          1        0.02       50.56
             .7 |          6        0.10       50.66
            .71 |          4        0.07       50.73
            .71 |          1        0.02       50.75
            .72 |          7        0.12       50.87
            .73 |         10        0.17       51.05
            .73 |          1        0.02       51.06
            .74 |          3        0.05       51.12
            .75 |          3        0.05       51.17
            .76 |          4        0.07       51.24
            .77 |          4        0.07       51.31
            .78 |          6        0.10       51.41
            .79 |          1        0.02       51.43
             .8 |         11        0.19       51.62
            .81 |          4        0.07       51.69
            .82 |          6        0.10       51.80
            .83 |          9        0.16       51.95
            .84 |          1        0.02       51.97
            .85 |          2        0.03       52.01
            .86 |          3        0.05       52.06
            .88 |          2        0.03       52.09
            .89 |          1        0.02       52.11
             .9 |          1        0.02       52.13
             .9 |          1        0.02       52.15
            .92 |          1        0.02       52.16
       .9299999 |          1        0.02       52.18
            .95 |          2        0.03       52.21
            .96 |          1        0.02       52.23
            .97 |          1        0.02       52.25
              1 |          9        0.16       52.41
           1.01 |         53        0.92       53.33
           1.02 |        146        2.55       55.88
           1.03 |        122        2.13       58.00
           1.04 |         52        0.91       58.91
           1.05 |         56        0.98       59.89
           1.06 |         17        0.30       60.18
           1.07 |          6        0.10       60.29
           1.08 |          2        0.03       60.32
           1.08 |          2        0.03       60.36
           1.09 |          3        0.05       60.41
            1.1 |          1        0.02       60.43
            1.1 |         10        0.17       60.60
           1.11 |          1        0.02       60.62
           1.12 |          4        0.07       60.69
           1.13 |          1        0.02       60.71
           1.14 |          1        0.02       60.73
           1.18 |          1        0.02       60.74
           1.19 |          1        0.02       60.76
            1.2 |          5        0.09       60.85
           1.22 |          1        0.02       60.87
           1.25 |          1        0.02       60.88
           1.26 |          2        0.03       60.92
           1.27 |          3        0.05       60.97
           1.28 |          5        0.09       61.06
           1.28 |          1        0.02       61.07
            1.3 |          5        0.09       61.16
           1.31 |          1        0.02       61.18
           1.32 |          2        0.03       61.21
           1.33 |          1        0.02       61.23
           1.35 |          2        0.03       61.27
           1.36 |          1        0.02       61.28
           1.39 |          1        0.02       61.30
            1.4 |          3        0.05       61.35
            1.4 |          1        0.02       61.37
           1.41 |          1        0.02       61.39
           1.42 |          2        0.03       61.42
           1.43 |          1        0.02       61.44
            1.5 |         11        0.19       61.63
           1.51 |         27        0.47       62.10
           1.52 |         57        0.99       63.10
           1.53 |         38        0.66       63.76
           1.54 |         17        0.30       64.06
           1.55 |         11        0.19       64.25
           1.56 |          3        0.05       64.30
           1.58 |          3        0.05       64.35
            1.6 |          4        0.07       64.42
           1.61 |          2        0.03       64.46
           1.62 |          3        0.05       64.51
           1.63 |          1        0.02       64.53
           1.72 |          3        0.05       64.58
           1.74 |          1        0.02       64.60
           1.75 |          1        0.02       64.61
           1.76 |          2        0.03       64.65
           1.77 |          3        0.05       64.70
           1.79 |          2        0.03       64.74
            1.8 |          3        0.05       64.79
           1.82 |          2        0.03       64.82
           1.83 |          1        0.02       64.84
           1.83 |          1        0.02       64.86
            1.9 |          1        0.02       64.88
           1.92 |          1        0.02       64.89
           1.94 |          1        0.02       64.91
              2 |         21        0.37       65.28
           2.01 |         46        0.80       66.08
           2.02 |        143        2.49       68.57
           2.03 |        147        2.56       71.14
           2.04 |         54        0.94       72.08
           2.05 |         65        1.13       73.21
           2.06 |         17        0.30       73.51
           2.07 |          7        0.12       73.63
           2.08 |          8        0.14       73.77
           2.09 |          1        0.02       73.79
            2.1 |         20        0.35       74.14
           2.11 |          1        0.02       74.15
           2.12 |          8        0.14       74.29
           2.13 |          3        0.05       74.35
           2.15 |          1        0.02       74.36
           2.15 |          3        0.05       74.42
           2.16 |          1        0.02       74.43
           2.19 |          2        0.03       74.47
            2.2 |          5        0.09       74.56
           2.23 |          2        0.03       74.59
           2.25 |          3        0.05       74.64
           2.26 |          1        0.02       74.66
           2.27 |          3        0.05       74.71
           2.28 |          3        0.05       74.76
            2.3 |          3        0.05       74.82
           2.33 |          1        0.02       74.83
            2.4 |          3        0.05       74.89
           2.41 |          1        0.02       74.90
           2.43 |          1        0.02       74.92
           2.49 |          2        0.03       74.96
            2.5 |         15        0.26       75.22
           2.51 |          7        0.12       75.34
           2.52 |         22        0.38       75.72
           2.53 |         34        0.59       76.32
           2.54 |         16        0.28       76.60
           2.55 |         11        0.19       76.79
           2.56 |          1        0.02       76.81
           2.57 |          1        0.02       76.82
           2.58 |          2        0.03       76.86
            2.6 |          2        0.03       76.89
           2.63 |          2        0.03       76.93
           2.65 |          1        0.02       76.94
           2.66 |          1        0.02       76.96
           2.66 |          1        0.02       76.98
           2.73 |          1        0.02       77.00
           2.75 |          2        0.03       77.03
           2.76 |          1        0.02       77.05
           2.77 |          1        0.02       77.07
           2.78 |          1        0.02       77.08
            2.8 |          2        0.03       77.12
           2.81 |          1        0.02       77.14
           2.82 |          2        0.03       77.17
           2.99 |          1        0.02       77.19
              3 |         10        0.17       77.36
           3.01 |         47        0.82       78.18
           3.02 |         88        1.53       79.72
           3.03 |        101        1.76       81.48
           3.04 |         35        0.61       82.09
           3.05 |         41        0.72       82.80
           3.06 |          6        0.10       82.91
           3.07 |          5        0.09       83.00
           3.08 |          2        0.03       83.03
           3.09 |          1        0.02       83.05
            3.1 |         11        0.19       83.24
           3.12 |          6        0.10       83.34
           3.15 |          3        0.05       83.40
           3.16 |          2        0.03       83.43
           3.22 |          1        0.02       83.45
           3.23 |          2        0.03       83.48
           3.24 |          1        0.02       83.50
           3.25 |          4        0.07       83.57
           3.27 |          1        0.02       83.59
           3.28 |          1        0.02       83.61
            3.3 |          5        0.09       83.69
           3.31 |          1        0.02       83.71
           3.34 |          1        0.02       83.73
           3.36 |          1        0.02       83.75
            3.4 |          1        0.02       83.76
            3.4 |          2        0.03       83.80
           3.46 |          1        0.02       83.82
           3.46 |          1        0.02       83.83
           3.48 |          1        0.02       83.85
            3.5 |          7        0.12       83.97
           3.51 |          3        0.05       84.03
           3.52 |          9        0.16       84.18
           3.53 |          7        0.12       84.30
           3.54 |          3        0.05       84.36
           3.55 |          6        0.10       84.46
           3.56 |          1        0.02       84.48
           3.57 |          2        0.03       84.51
           3.58 |          1        0.02       84.53
            3.6 |          3        0.05       84.58
            3.6 |          1        0.02       84.60
           3.62 |          2        0.03       84.64
           3.63 |          1        0.02       84.65
           3.64 |          1        0.02       84.67
           3.65 |          1        0.02       84.69
           3.67 |          1        0.02       84.71
            3.7 |          2        0.03       84.74
           3.74 |          1        0.02       84.76
           3.75 |          3        0.05       84.81
           3.77 |          1        0.02       84.83
           3.78 |          1        0.02       84.84
            3.8 |          1        0.02       84.86
           3.81 |          1        0.02       84.88
           3.88 |          1        0.02       84.90
              4 |          4        0.07       84.97
           4.01 |         13        0.23       85.19
           4.02 |         57        0.99       86.19
           4.03 |          1        0.02       86.21
           4.03 |         61        1.06       87.27
           4.04 |         25        0.44       87.70
           4.05 |         19        0.33       88.04
           4.06 |         10        0.17       88.21
           4.08 |          4        0.07       88.28
           4.08 |          1        0.02       88.30
            4.1 |          6        0.10       88.40
            4.2 |          3        0.05       88.45
           4.22 |          1        0.02       88.47
           4.25 |          2        0.03       88.51
           4.28 |          1        0.02       88.52
            4.4 |          1        0.02       88.54
           4.43 |          1        0.02       88.56
           4.47 |          1        0.02       88.58
            4.5 |          3        0.05       88.63
           4.51 |          1        0.02       88.65
           4.52 |          9        0.16       88.80
           4.53 |          3        0.05       88.86
           4.54 |          5        0.09       88.94
           4.55 |          2        0.03       88.98
           4.66 |          2        0.03       89.01
           4.72 |          1        0.02       89.03
           4.78 |          1        0.02       89.05
           4.79 |          1        0.02       89.07
            4.8 |          1        0.02       89.08
            4.9 |          1        0.02       89.10
              5 |          6        0.10       89.20
           5.01 |         16        0.28       89.48
           5.02 |         74        1.29       90.77
           5.03 |         40        0.70       91.47
           5.04 |         34        0.59       92.06
           5.05 |         36        0.63       92.69
           5.06 |          8        0.14       92.83
           5.07 |          3        0.05       92.88
           5.08 |          8        0.14       93.02
            5.1 |          7        0.12       93.15
           5.12 |          3        0.05       93.20
           5.15 |          1        0.02       93.22
           5.18 |          1        0.02       93.23
            5.2 |          1        0.02       93.25
           5.25 |          3        0.05       93.30
           5.26 |          1        0.02       93.32
           5.28 |          1        0.02       93.34
            5.3 |          1        0.02       93.36
           5.34 |          1        0.02       93.37
           5.35 |          1        0.02       93.39
            5.4 |          2        0.03       93.43
            5.5 |          3        0.05       93.48
           5.51 |          1        0.02       93.49
           5.52 |          4        0.07       93.56
           5.53 |          3        0.05       93.62
           5.54 |          2        0.03       93.65
           5.55 |          2        0.03       93.69
           5.56 |          1        0.02       93.70
            5.6 |          1        0.02       93.72
           5.84 |          1        0.02       93.74
              6 |          1        0.02       93.76
           6.01 |          5        0.09       93.84
           6.02 |         12        0.21       94.05
           6.03 |         15        0.26       94.31
           6.04 |          5        0.09       94.40
           6.05 |          7        0.12       94.52
           6.06 |          5        0.09       94.61
           6.07 |          4        0.07       94.68
           6.08 |          2        0.03       94.72
            6.1 |          1        0.02       94.73
           6.12 |          2        0.03       94.77
           6.15 |          1        0.02       94.79
            6.2 |          1        0.02       94.80
           6.25 |          1        0.02       94.82
           6.35 |          1        0.02       94.84
           6.37 |          1        0.02       94.86
            6.5 |          1        0.02       94.87
           6.52 |          3        0.05       94.93
           6.53 |          1        0.02       94.94
           6.54 |          2        0.03       94.98
           6.55 |          1        0.02       94.99
            6.6 |          1        0.02       95.01
            6.7 |          1        0.02       95.03
           6.75 |          1        0.02       95.05
            6.8 |          1        0.02       95.06
           6.83 |          1        0.02       95.08
           7.01 |          4        0.07       95.15
           7.02 |         15        0.26       95.41
           7.03 |          8        0.14       95.55
           7.04 |         10        0.17       95.73
           7.05 |         10        0.17       95.90
           7.06 |          1        0.02       95.92
           7.07 |          3        0.05       95.97
            7.2 |          1        0.02       95.99
            7.5 |          1        0.02       96.01
           7.53 |          2        0.03       96.04
           7.54 |          1        0.02       96.06
           7.55 |          2        0.03       96.09
           7.75 |          1        0.02       96.11
           7.79 |          1        0.02       96.13
            7.8 |          1        0.02       96.15
           7.81 |          1        0.02       96.16
           8.02 |         10        0.17       96.34
           8.03 |          8        0.14       96.48
           8.04 |          8        0.14       96.62
           8.05 |          6        0.10       96.72
           8.06 |          6        0.10       96.83
           8.07 |          1        0.02       96.84
            8.1 |          3        0.05       96.90
           8.15 |          1        0.02       96.91
            8.5 |          1        0.02       96.93
           8.52 |          2        0.03       96.97
           8.53 |          1        0.02       96.98
           8.54 |          3        0.05       97.04
           8.64 |          1        0.02       97.05
           9.01 |          1        0.02       97.07
           9.02 |          1        0.02       97.09
           9.03 |          4        0.07       97.16
       9.030001 |          1        0.02       97.17
           9.04 |          3        0.05       97.23
           9.05 |          1        0.02       97.24
           9.06 |          1        0.02       97.26
           9.08 |          1        0.02       97.28
            9.1 |          2        0.03       97.31
           9.18 |          1        0.02       97.33
           9.28 |          1        0.02       97.35
            9.5 |          1        0.02       97.37
             10 |          3        0.05       97.42
          10.02 |          7        0.12       97.54
          10.03 |          8        0.14       97.68
          10.04 |          6        0.10       97.79
          10.05 |         10        0.17       97.96
          10.06 |          1        0.02       97.98
          10.06 |          3        0.05       98.03
          10.08 |          2        0.03       98.06
           10.1 |          2        0.03       98.10
          10.12 |          1        0.02       98.12
          10.15 |          2        0.03       98.15
             11 |          1        0.02       98.17
          11.03 |          1        0.02       98.19
          11.04 |          1        0.02       98.20
          11.07 |          1        0.02       98.22
           11.2 |          1        0.02       98.24
             12 |          1        0.02       98.26
          12.03 |          2        0.03       98.29
          12.04 |          2        0.03       98.33
          12.05 |          4        0.07       98.40
          12.06 |          1        0.02       98.41
          12.07 |          1        0.02       98.43
           12.3 |          2        0.03       98.47
          12.53 |          1        0.02       98.48
          12.89 |          1        0.02       98.50
             13 |          1        0.02       98.52
          13.02 |          1        0.02       98.54
          13.03 |          1        0.02       98.55
          13.04 |          1        0.02       98.57
          13.05 |          2        0.03       98.60
          13.06 |          1        0.02       98.62
          13.12 |          1        0.02       98.64
          13.53 |          1        0.02       98.66
          13.58 |          1        0.02       98.67
          14.02 |          2        0.03       98.71
          14.03 |          1        0.02       98.73
          14.07 |          1        0.02       98.74
          14.54 |          1        0.02       98.76
           14.8 |          1        0.02       98.78
          14.91 |          1        0.02       98.80
             15 |          1        0.02       98.81
          15.03 |          2        0.03       98.85
          15.04 |          4        0.07       98.92
          15.05 |          3        0.05       98.97
          15.06 |          2        0.03       99.01
          15.07 |          1        0.02       99.02
          15.08 |          1        0.02       99.04
           15.5 |          1        0.02       99.06
          15.53 |          1        0.02       99.08
          15.55 |          1        0.02       99.09
          16.03 |          1        0.02       99.11
          16.04 |          1        0.02       99.13
          16.05 |          1        0.02       99.15
          17.02 |          1        0.02       99.16
          17.05 |          1        0.02       99.18
          17.06 |          1        0.02       99.20
          18.02 |          1        0.02       99.22
          18.03 |          1        0.02       99.23
           18.1 |          1        0.02       99.25
          18.15 |          1        0.02       99.27
           18.5 |          1        0.02       99.28
          19.03 |          1        0.02       99.30
          20.01 |          2        0.03       99.34
          20.02 |          1        0.02       99.35
          20.04 |          4        0.07       99.42
          20.05 |          1        0.02       99.44
          20.06 |          1        0.02       99.46
          20.08 |          1        0.02       99.48
           20.1 |          1        0.02       99.49
           20.8 |          1        0.02       99.51
           22.2 |          1        0.02       99.53
           23.1 |          1        0.02       99.55
          24.12 |          1        0.02       99.56
          25.03 |          2        0.03       99.60
          25.04 |          1        0.02       99.62
          25.06 |          1        0.02       99.63
          27.05 |          1        0.02       99.65
          28.14 |          1        0.02       99.67
          28.15 |          1        0.02       99.69
           28.2 |          1        0.02       99.70
          29.03 |          1        0.02       99.72
             30 |          1        0.02       99.74
          30.04 |          1        0.02       99.76
          30.05 |          1        0.02       99.77
             33 |          1        0.02       99.79
          35.05 |          1        0.02       99.81
           35.1 |          1        0.02       99.83
          38.26 |          1        0.02       99.84
           43.5 |          1        0.02       99.86
             45 |          1        0.02       99.88
          50.03 |          1        0.02       99.90
          50.04 |          1        0.02       99.91
             75 |          1        0.02       99.93
          75.02 |          1        0.02       99.95
            100 |          3        0.05      100.00
    ------------+-----------------------------------
          Total |      5,734      100.00
    I understand I can adopt several methods to deal with these outliers like winsorizing, trimming, transformation. I think, in my context, trimming would take away the natural variation that exists in my data and would not be appropriate.Will winsorizing be appropriate here, given the distribution of the data? Also, are there other methods of estimation where one can deal with outliers without doing any of the operations beforehand?

    P.S: I'm in economics, if it is important to answer the question.

    Would appreciate your help in this regard.
    Thanks!

  • #2
    univar is community-contrbuted, as you are asked to explain (FAQ Advice #12). Indeed, it is hard to find it unless you know already where to look.

    STB-51 sg67.1 . . . . . . . . . . . . . . . . . . . . . . . Update to univar
    (help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
    9/99 pp.27--28; STB Reprints Vol 9, pp.159--161
    improvements and new options to univar

    STB-36 sg67 . . . . . . . . . . . . . . . Univariate summaries with boxplots
    (help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
    3/97 pp.23--25; STB Reprints Vol 6, pp.179--183
    command that offers a streamlined display of univariate summaries,
    including, optionally, text-mode boxplots



    Elsewhere I mentioned various reactions to (possible) outliers, here edited slightly

    What to do with outliers is an open and very difficult question. Loosely, different solutions and strategies have varying appeal. Here is a partial list of possibilities. The ordering is arbitrary and not meant to convey any order in terms of applicability, importance or any other criterion. Nor are these approaches mutually exclusive.

    * One (in my view good) definition is that "[o]utliers are sample values that cause surprise in relation to the majority of the sample" (W.N. Venables and B.D. Ripley. 2002. Modern Applied
    Statistics with S,
    New York: Springer, p.119). However, surprise is in the mind of the beholder and is dependent on some tacit or explicit model of the data. There may be another model under which
    the outlier is not surprising at all, so the data really are (say) lognormal or gamma rather than normal. In short, be prepared to (re)consider your model.

    * Go into the laboratory or the field and do the measurement again. Often this is not practicable, but it would seem standard in several sciences.

    * Test whether outliers are genuine. Most of the tests look pretty contrived to me, but you might find one that you can believe fits your situation. Irrational faith that a test is appropriate is always needed
    to apply a test that is then presented as quintessentially rational.

    * Throw them out as a matter of judgement.

    * Throw them out using some more-or-less automated (usually not "objective") rule.

    * Ignore them, partially or completely. This could be formal (e.g. trimming) or just a matter of leaving them in the dataset, but omitting them from analyses as too hot to handle.

    * Pull them in using some kind of adjustment, e.g. Winsorizing.

    * Downplay them by using some other robust estimation method.

    * Downplay them by working on a transformed scale.

    * Downplay them by using a non-identity link function.

    * Accommodate them by fitting some appropriate fat-, long-, or heavy-tailed distribution, without or with predictors.

    * Accommodate by using an indicator or dummy variable as an extra predictor in a model.

    * Side-step the issue by using some non-parametric (e.g. rank-based) procedure.

    * Get a handle on the implied uncertainty using bootstrapping, jackknifing or permutation-based procedure.

    * Edit to replace an outlier with some more likely value, based on deterministic logic. "An 18-year-old grandmother is unlikely, but the person in question was born in 1940, so presumably is really 81."

    * Edit to replace an impossible or implausible outlier using some imputation method that is currently acceptable not-quite-white magic.

    * Analyse with and without, and seeing how much difference the outlier(s) make(s), statistically, scientifically or practically.

    * Something Bayesian. My prior ignorance of quite what forbids from giving any details.

    This isn't intended as a complete list -- partial is what it says -- and indeed I hope it prompts people to add something I've forgotten (likely) or didn't know about (very likely).

    Being from economics does not to me make any difference, although any inclination to do something that fits in with current fashions and foibles in your field is something others may be able to advise on. I see here a percent (percent of land that is of a certain kind?) and in a large dataset it doesn't seem surprising that values range from 0 to 100.

    There is intriguing fine structure here, perhaps best shown by looking at the fractional part of each percent:

    Code:
    gen fraction = mod(land_amount, 1)
    
    spikeplot fraction, root yla(0 10 "100" 20 "400" 30 "900", ang(h)) xla(0(0.1)1) xmtick(0(0.01)1) ytitle(Frequency (square root scale)) xtitle(Fraction) scheme(s1color)
    
    quantile fraction, yla(0(0.1)1, ang(h)) ymtick(0(0.01)1) scheme(s1color) ms(oh) msize(vsmall)
    Click image for larger version

Name:	titir1.png
Views:	1
Size:	23.7 KB
ID:	1627691


    Click image for larger version

Name:	titir2.png
Views:	1
Size:	30.9 KB
ID:	1627692


    The diagonal reference line produced by quantile (showing what you would get with a uniform (rectangular, flat) distribution over the data range) is usually a distraction, but here it is pertinent.

    Why are many answers just above an integer %? Perhaps this is an artefact of division of numerator and denominator at some point. Perhaps it shows up a worrying data quality problem. In quite different datasets I've seen dishonest reporters fuzz up round figures in an attempt to convey that they were produced by exact measurement. I have no information on this variable otherwise.

    That said, the big picture here is of a very skewed distribution ranging from 0 to 100. I would not worry much about that. The highest values are so rare that there is little danger of their disturbing model fits, although you might want to check that quickly. Trimming and Winsorizing are not good ideas. Transformations that would pull in the tails a little and symmetrically include folded root (sqrt(percent) - sqrt(100 - percent)) and angular (arcsine square root). Indeed, as your concern is with very high values, square root to pull in high values is even simpler. Here i use transplot from SSC and qplot from the Stata Journal to show the effects of some possible transformations. (I called the variable land_pc in my work to shorten axis titles.)


    Click image for larger version

Name:	Graph.png
Views:	1
Size:	33.1 KB
ID:	1627693


    See https://www.statalist.org/forums/for...dable-from-ssc on transplot.

    Comment


    • #3
      There are some community-contributed commands for detecting outlier. For example, -screen- (Marco Santacroce), -grubbs- (Nicolas Couderc), -gboxplot- (Christopher Bruffaerts, Vincenzo Verardi, and Catherine Vermandele), -sdasym- (Christopher Bruffaerts, Vincenzo Verardi, and Catherine Vermandele), and -winsor- (Nicholas J. Cox).

      Comment


      • #4
        winsor from SSC does not purport to detect outliers. Just possibly, if you have outliers you could use it as advertised, but I can't let that statement stand.

        Comment


        • #5
          Yes, command winsor is designed for winsorizing a variable.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            univar is community-contrbuted, as you are asked to explain (FAQ Advice #12). Indeed, it is hard to find it unless you know already where to look.

            STB-51 sg67.1 . . . . . . . . . . . . . . . . . . . . . . . Update to univar
            (help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
            9/99 pp.27--28; STB Reprints Vol 9, pp.159--161
            improvements and new options to univar

            STB-36 sg67 . . . . . . . . . . . . . . . Univariate summaries with boxplots
            (help univar if installed) . . . . . . . . . . . . . . J. R. Gleason
            3/97 pp.23--25; STB Reprints Vol 6, pp.179--183
            command that offers a streamlined display of univariate summaries,
            including, optionally, text-mode boxplots



            Elsewhere I mentioned various reactions to (possible) outliers, here edited slightly

            What to do with outliers is an open and very difficult question. Loosely, different solutions and strategies have varying appeal. Here is a partial list of possibilities. The ordering is arbitrary and not meant to convey any order in terms of applicability, importance or any other criterion. Nor are these approaches mutually exclusive.

            * One (in my view good) definition is that "[o]utliers are sample values that cause surprise in relation to the majority of the sample" (W.N. Venables and B.D. Ripley. 2002. Modern Applied
            Statistics with S,
            New York: Springer, p.119). However, surprise is in the mind of the beholder and is dependent on some tacit or explicit model of the data. There may be another model under which
            the outlier is not surprising at all, so the data really are (say) lognormal or gamma rather than normal. In short, be prepared to (re)consider your model.

            * Go into the laboratory or the field and do the measurement again. Often this is not practicable, but it would seem standard in several sciences.

            * Test whether outliers are genuine. Most of the tests look pretty contrived to me, but you might find one that you can believe fits your situation. Irrational faith that a test is appropriate is always needed
            to apply a test that is then presented as quintessentially rational.

            * Throw them out as a matter of judgement.

            * Throw them out using some more-or-less automated (usually not "objective") rule.

            * Ignore them, partially or completely. This could be formal (e.g. trimming) or just a matter of leaving them in the dataset, but omitting them from analyses as too hot to handle.

            * Pull them in using some kind of adjustment, e.g. Winsorizing.

            * Downplay them by using some other robust estimation method.

            * Downplay them by working on a transformed scale.

            * Downplay them by using a non-identity link function.

            * Accommodate them by fitting some appropriate fat-, long-, or heavy-tailed distribution, without or with predictors.

            * Accommodate by using an indicator or dummy variable as an extra predictor in a model.

            * Side-step the issue by using some non-parametric (e.g. rank-based) procedure.

            * Get a handle on the implied uncertainty using bootstrapping, jackknifing or permutation-based procedure.

            * Edit to replace an outlier with some more likely value, based on deterministic logic. "An 18-year-old grandmother is unlikely, but the person in question was born in 1940, so presumably is really 81."

            * Edit to replace an impossible or implausible outlier using some imputation method that is currently acceptable not-quite-white magic.

            * Analyse with and without, and seeing how much difference the outlier(s) make(s), statistically, scientifically or practically.

            * Something Bayesian. My prior ignorance of quite what forbids from giving any details.

            This isn't intended as a complete list -- partial is what it says -- and indeed I hope it prompts people to add something I've forgotten (likely) or didn't know about (very likely).

            Being from economics does not to me make any difference, although any inclination to do something that fits in with current fashions and foibles in your field is something others may be able to advise on. I see here a percent (percent of land that is of a certain kind?) and in a large dataset it doesn't seem surprising that values range from 0 to 100.

            There is intriguing fine structure here, perhaps best shown by looking at the fractional part of each percent:

            Code:
            gen fraction = mod(land_amount, 1)
            
            spikeplot fraction, root yla(0 10 "100" 20 "400" 30 "900", ang(h)) xla(0(0.1)1) xmtick(0(0.01)1) ytitle(Frequency (square root scale)) xtitle(Fraction) scheme(s1color)
            
            quantile fraction, yla(0(0.1)1, ang(h)) ymtick(0(0.01)1) scheme(s1color) ms(oh) msize(vsmall)
            [ATTACH=CONFIG]n1627691[/ATTACH]

            [ATTACH=CONFIG]n1627692[/ATTACH]

            The diagonal reference line produced by quantile (showing what you would get with a uniform (rectangular, flat) distribution over the data range) is usually a distraction, but here it is pertinent.

            Why are many answers just above an integer %? Perhaps this is an artefact of division of numerator and denominator at some point. Perhaps it shows up a worrying data quality problem. In quite different datasets I've seen dishonest reporters fuzz up round figures in an attempt to convey that they were produced by exact measurement. I have no information on this variable otherwise.

            That said, the big picture here is of a very skewed distribution ranging from 0 to 100. I would not worry much about that. The highest values are so rare that there is little danger of their disturbing model fits, although you might want to check that quickly. Trimming and Winsorizing are not good ideas. Transformations that would pull in the tails a little and symmetrically include folded root (sqrt(percent) - sqrt(100 - percent)) and angular (arcsine square root). Indeed, as your concern is with very high values, square root to pull in high values is even simpler. Here i use transplot from SSC and qplot from the Stata Journal to show the effects of some possible transformations. (I called the variable land_pc in my work to shorten axis titles.)


            [ATTACH=CONFIG]n1627693[/ATTACH]

            See https://www.statalist.org/forums/for...dable-from-ssc on transplot.
            Thank you so much Nick for explaining it in such detail. My apologies for not returning to this post sooner; I was caught up in another project and could only come back to this now.

            The unit of the land amount variable is acres. So households in my data have landholding ranging from 0.1 acre to 100 acres. I agree with your observation that this is a highly skewed distribution- which reflects the reality of landholding distribution in the country under study.

            I'll take your advice regarding using a square root. I was wondering, since the land variable is a control variable in my study, is it acceptable if only this variable is taken in square root,but dependent and independent variable of interest and other control variables are not in square root?

            Thanks,

            Apologies once again for returning to this post after such a long time.

            Comment


            • #7
              I understand about distractions.

              But acceptable? I don't mind ad hoc transformations -- where ad hoc means ideally fit for purpose -- otherwise I wouldn't have suggested one. But I can't speak for any gatekeepers downstream of your work, whether advisors, supervisors, examiners, reviewers, editors, whoever. .

              Statalist can be oracular about Stata coding but it can't serve as a definitive guide for your research choices.

              The positive advice about any transformation is just to try it to see how much difference it makes, and how model performance and diagnostics vary.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                I understand about distractions.

                But acceptable? I don't mind ad hoc transformations -- where ad hoc means ideally fit for purpose -- otherwise I wouldn't have suggested one. But I can't speak for any gatekeepers downstream of your work, whether advisors, supervisors, examiners, reviewers, editors, whoever. .

                Statalist can be oracular about Stata coding but it can't serve as a definitive guide for your research choices.

                The positive advice about any transformation is just to try it to see how much difference it makes, and how model performance and diagnostics vary.
                Thank you Nick. I think I'll just try the different transformations and see how it goes.

                Comment

                Working...
                X