Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • accessing the winsorized sample

    Hi there,
    I have winsorized several variables in my dataset to exclude the top 5% of responses. I would then like to create a variable that identifies the winsorized sample across all these variables i.e., to understand if the same people are reporting really high values across variables. How can I do this?

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int record float(nonspec_sum psychiatrist_sum psychologist_sum socialworker_sum other_sum nonspec_sumw psychiatrist_sumw psychologist_sumw socialworker_sumw other_sumw)
    1217  3  .  .  .  1  3  .  .  .  1
    1227  .  2  4  2  1  .  2  4  2  1
    1230  . 15  8  6  .  . 15  8  6  .
    1234  .  3  3  4  .  .  3  3  4  .
    1236  2  .  .  1  1  2  .  .  1  1
    1238  .  .  .  2  .  .  .  .  2  .
    1239  .  3  2  .  .  .  3  2  .  .
    1241  2  .  2  2  .  2  .  2  2  .
    1248  .  .  .  .  .  .  .  .  .  .
    1251  4  3  2  1  .  4  3  2  1  .
    1259  .  2  2  2  2  .  2  2  2  2
    1260  .  4  4  4  .  .  4  4  4  .
    1264  .  3  .  . 15  .  3  .  . 15
    1267  1  .  .  .  1  1  .  .  .  1
    1271  .  5  1  .  .  .  5  1  .  .
    1275  2  2  2  2  2  .  .  .  .  .
    1276  .  4  4  4  .  .  4  4  4  .
    1279  .  .  3  .  .  .  .  3  .  .
    1280  .  .  .  .  .  .  .  .  .  .
    1283  9  6  8  8  .  9  6  8  8  .
    1284  2  .  1  .  .  2  .  1  .  .
    1285  3  2  .  8  7  3  2  .  8  7
    1288  8  7  .  7  .  8  7  .  7  .
    1294  .  .  4  4  .  .  .  4  4  .
    1296  .  .  .  .  .  .  .  .  .  .
    1297  .  .  1  .  1  .  .  1  .  1
    1310  .  .  4  4  .  .  .  4  4  .
    1314  .  2  .  .  .  .  2  .  .  .
    1319  .  5  2  .  .  .  5  2  .  .
    1321  .  7  .  3  .  .  7  .  3  .
    1327  .  .  .  .  .  .  .  .  .  .
    1331  5  4  .  .  .  5  4  .  .  .
    1333  .  .  .  .  .  .  .  .  .  .
    1338  .  .  .  .  2  .  .  .  .  2
    1341  .  1  1  .  .  .  1  1  .  .
    1342  1  .  .  1  1  1  .  .  1  1
    1343  .  .  3  .  .  .  .  3  .  .
    1346  2  .  .  .  .  2  .  .  .  .
    1348  .  .  .  2  .  .  .  .  2  .
    1349 11  7  3  .  . 11  7  3  .  .
    1351  .  2  .  .  .  .  2  .  .  .
    1353  2  2  2  .  .  2  2  2  .  .
    1356  3  3  3  4  .  3  3  3  4  .
    1357  4  2  .  3  3  4  2  .  3  3
    1359  4  .  4  2  4  4  .  4  2  4
    1360  .  .  .  .  .  .  .  .  .  .
    1361  3  .  5  3  .  3  .  5  3  .
    1363  .  .  .  .  .  .  .  .  .  .
    1364  2  2  2  2  2  2  2  2  2  2
    1365  .  8 14  .  .  .  8 14  .  .
    1370  3  .  2  .  .  3  .  2  .  .
    1378  9  6  9 13  .  9  6  9 13  .
    1381  .  .  .  .  .  .  .  .  .  .
    1382  2  2  2  2  2  2  2  2  2  2
    1389  2  2  2  2  2  2  2  2  2  2
    1390  .  4  5  .  .  .  4  5  .  .
    1393  .  3  4  .  .  .  3  4  .  .
    1395  .  6  .  .  .  .  6  .  .  .
    1400  7  .  .  .  .  7  .  .  .  .
    1409  .  3  3  .  .  .  3  3  .  .
    1411  .  .  .  2  3  .  .  .  2  3
    1414  .  .  . 12  .  .  .  . 12  .
    1422  .  3  3  .  .  .  3  3  .  .
    1423  .  .  .  .  .  .  .  .  .  .
    1428  .  6  1  .  .  .  6  1  .  .
    1429  .  .  .  .  .  .  .  .  .  .
    1439  .  .  .  .  .  .  .  .  .  .
    1442  .  .  .  7  4  .  .  .  7  4
    1448  4  4  4  4  4  4  4  4  4  4
    1449  1  .  .  1  .  1  .  .  1  .
    1452  .  .  3  .  .  .  .  3  .  .
    1457  .  .  .  1  .  .  .  .  1  .
    1458  1  2  .  .  .  1  2  .  .  .
    1461  4  .  4  .  4  .  .  .  .  .
    1462  .  1  .  .  .  .  1  .  .  .
    1467  3  5  .  .  5  3  5  .  .  5
    1468  2  .  .  2  2  2  .  .  2  2
    1472  4  .  4  .  4  4  .  4  .  4
    1483  .  4  4  4  .  .  4  4  4  .
    1494  .  2  2  .  .  .  2  2  .  .
    1499  .  .  .  .  .  .  .  .  .  .
    1500  2  2  2  .  .  2  2  2  .  .
    1507  .  .  .  .  .  .  .  .  .  .
    1509  .  .  .  .  .  .  .  .  .  .
    1511  .  2  2  .  .  .  2  2  .  .
    1514  .  5  .  2  .  .  5  .  2  .
    1516  .  .  .  .  .  .  .  .  .  .
    1519  .  2  2  .  .  .  2  2  .  .
    1522  .  2  2  .  .  .  2  2  .  .
    1526  .  .  .  .  .  .  .  .  .  .
    1531  3  4  3  .  .  3  4  3  .  .
    1532  .  2  2  .  .  .  2  2  .  .
    1536  .  2  .  1  .  .  2  .  1  .
    1543  5  4  5  4  5  5  4  5  4  5
    1546  6  5  7  .  .  6  5  7  .  .
    1547  .  3  4  .  .  .  3  4  .  .
    1553  2  2  2  2  2  .  .  .  .  .
    1554  .  3  .  .  .  .  3  .  .  .
    1557  .  .  .  .  .  .  .  .  .  .
    1558  6  6  6  6  6  .  .  .  .  .
    end

    Thank you in advance for your help with this.

  • #2
    I wrote winsor on SSC which may make what follows puzzling, but there you go. Someone must have asked for Winsorizing code way back and I wrote some. Independently someone else wrote winsor2 and did some things differently and some extra things.

    Winsorizing is one way to get a version of the mean or the variance that is a resistant summary, as compared with not Winsorizing. Winsorizing as a strategy to identify and decide what to do with outliers seems to me indirect, arbitrary and awkward, especially for the multivariate case. A data point that looks like an outlier on any single variable may make perfect sense in multivariate context. Contemplation of just about any scatter plot underlines the difficulty.

    I am not clear quite what you want, perhaps an indicator variable for being Winsorized for each variable and then to look at those several indicators. My command doesn't support the creation of an associated indicator but the code inside would show you how to do it.

    I have often challenged people asking about Winsorizing here on Statalist to point to a good textbook or review paper explaining why Winsorizing is a really sound idea and preferable to alternatives. No one ever has. This is admittedly a loaded challenge, as I need to agree on "good".

    Your example dataset looks intensely problematic any way because of the frequency of missing values.

    That may not seem helpful but it is intended to be honest.

    Comment

    Working...
    X