Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I create decile of an income variable by a group

    I want to calculate the value of the decile for each county my dataset. I´m struggling with that task. Right now I´m doing:

    gen size_decile = .
    levelsof COUNTY, local(COUNTY_1)
    foreach i in `COUNTY_1' {
    pctile decile_temp=INCOME if COUNTY==`i', nq(10)
    replace size_decile = decile_temp if missing(size_decile)
    drop decile_temp
    }











  • #2
    If what you mean is that for each observation in a given county, you want to assign to size_decile the number 1, 2, ... , or 10 depending on which income decile the observation is in, for that county, then you need to be using the xtile command rather than the pctile command.

    Added: crossed with Clyde's advice which, as always, is more comprehensive than mine. With that said, I am receptive to replacing income with its xtile (decile, for Francisco; quintile is what I've used) in modeling (treating the xtile as a categorical variable in my case). No matter how you transform income, it takes just one 980-million-dollar loss to ruin your modeling, and your day. Of course, I'm looking at quintiles based on 1000's of observations, and of fairly precisely recorded values for income so clumping is not so much an issue. (Note well that I wrote "precisely recorded" rather than "accurately recorded".)
    Last edited by William Lisowski; 05 Oct 2016, 11:06.

    Comment


    • #3
      You don't describe what the difficulty is that you're having. But I'll make a guess. I'm guessing that what you want is to assign each observation to membership in one of 10 nearly equal sized income groups within each county. If that's what you want, the command you need is -xtile-, not -pctile-.

      That said, just be warned that there is a decent chance that you will be unhappy with the results you get. Unless income is a truly continuous variable that doesn't "clump" in your data, it is unusually lucky to actually be able to find 10 nearly equal sized groups based on it. It is likely there will be some dividing line that must be placed either just below or just above some observation that occurs a large number of times, so that the groups are inevitably quite imbalanced. If your plan is to use these deciles as predictors in a regression model, you should consider instead using the income variable itself, or perhaps a linear or cubic spline based on it to reflect non-linearity. Categorizing inherently continuous variables just throws away information. If you are just using the deciles as a way to exhibit trends in some other variable, consider using correlation coefficients instead.

      Also, in the future please post your code within code delimiters, as explained in FAQ #12. It makes the code easier to read. Thank you.

      Added: crossed with William's advice which, as always, is spot on.

      Comment


      • #4
        I agree with William and Clyde (the complementary event has measure zero).

        I'll add that I've now read possibly a few hundred threads on Statalist with this issue at their heart.

        The issue is disciplinary, it seems.

        It seems that in many branches of economics, finance, and business studies there are researchers who like to chop variables into quantile bins and then say things like "OK, what can we say about the best 10% of firms?", and so on, and so forth.

        On the contrary, in many fields -- epidemiology and biostatistics seems to be one and the various environmental sciences in which I lurk are others -- the reaction is What? Why chop? Why degrade quantitative predictors in this way? You are throwing away information and arbitrarily.

        So, that's a stark summary, and the tension won't go away.

        As Hemingway and Fitzgerald said in a particularly insightful exchange,

        The rich are different from us.
        Yes, they have more money.

        I'm with those who want "money" to be a predictor, not "is rich". Or even "money bin".

        Comment


        • #5
          Nick was probably typing while I was updating my initial answer to include my reflections on Clyde's response.

          I'm with bins that recognize that the precision with which subjects report their income has little to do with the accuracy of the values so precisely reported, or for that matter, so little to do with the actual construct of "money available to spend". And which does not require normalization to "constant dollars" for e.g. panel data, where than normalization is itself a matter of considerable disagreement. And finally, Stamp's Law of Statistics remains an important consideration.

          I'm not quite sure what the equivalent concept is in epidemiology and biostatistics is - I've never been given my systolic and diastolic blood pressures adjusted to those of a 20-year-old.

          Comment


          • #6
            I actually agree with William that binned responses can sometimes have greater validity than a crude guess at a continuous variable. Certainly when I am asked my income on a survey and they offer check boxes associated with ranges of reasonable width, I have no trouble accurately picking the right one. I don't know my exact income down to the dollar, even the thousand dollars, in part because it varies somewhat from year to year. But even with that variation it only occasionally crosses a typical bin dividing line.

            That said, the bins typically offered to elicit accurate responses like that rarely correspond to pre-specified quantiles of the variable's distribution. And specifically with regard to income, I would be willing to bet that few people if asked what percentile of the income distribution they lie at could say anything other than "I don't know." And of those that did respond, I'd wager that most would be quite incorrect.

            Comment


            • #7
              I agree with all that Clyde has written. I would never elicit a survey respondent's estimate of the family income as a quantile of the population distribution. You've got to ask something that the respondent has some chance of knowing and remembering - but then you need to acknowledge the likely problems with their recall and reporting, and for that matter, with the role of the data in a model.

              I view the process of binning a continuous value for self-reported income and treating it as categorical rather than continuous as a simple, and perhaps robust, low-tech procedure that deals with the following common problems with self-reported income data.
              • inaccurate and incomplete recall
              • exclusion of non-cash benefits
              • unwillingness to respond - there's a reason that "family income" is usually asked near the end of the questionnaire
              • outliers
              • a non-linear response to income of the variable of interest: e.g. middle-income pursuits unaffordable to those with low incomes and too déclassé for the wealthy
              • for income measured multiple waves, wave-specific xtiles eliminate adjusting to "constant dollars" (pounds sterling, euros, etc.)
              • use of income as a proxy for socio-economic status, rather than being of interest in and of itself
              • at least binning is not Winsorizing
              I will add that I am continually bemused by the amount of mathematical effort we see put into obtaining econometric estimates of model parameters that take into account every possible violation of modeling assumptions in order to get a theoretically justifiable estimate and its standard error, while - because there's no systematic theory behind it - little to no effort is expended of the likely problems of the underlying data expressed by Stamp's Law.
              Last edited by William Lisowski; 05 Oct 2016, 19:17.

              Comment


              • #8
                This has been an interesting read. There is no right or wrong, so we shouldn't be surprised at practices in different fields, because I'm sure that each one has its reasons. In the natural sciences measuring and experiments are easy to perform and most likely much more accurate than in social sciences like in economics where we use at best observational data, and most often surveyed data. The nature of the data is the one that dictates the processes and directions that the users of such data have to take and "invent" in order to be able to analyze it. A doctor has no need in surveying patients about where their blood pressure lies by presenting different bins and having the patient select one. He simply wraps the patient's arm and takes a measurement. Then he has to worry about the machine's margin of error, but that is a total different problem than having to rely on a person's willingness to recognize how little money he/she makes or what a humongous amount of money he/she makes.
                Alfonso Sanchez-Penalver

                Comment

                Working...
                X