Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to convert continuous variable from categorical variable by assigning midpoints?

    Hi Statalisters,

    I am looking at the American Time Use dataset, where the variable for household income is categorical and has 16 categories (as shown in the tables below). I would like to convert this to a continuous variable, similar to what Zilanwala (2014), p. 10-11 does. Here is what she says she has done:

    “Income is converted from these categorical responses to dollar amounts by assigning the midpoint of each category and representing income in thousands of dollars. The last category is topcoded to $200,000”


    ta hefaminc

    Edited: Family |
    Income | Freq. Percent Cum.
    ---------------------+-----------------------------------
    Less than $5,000 | 303 3.36 3.36
    $5,000 to $7,499 | 173 1.92 5.28
    $7,500 to $9,999 | 245 2.72 8.00
    $10,000 to $12,499 | 289 3.21 11.21
    $12,500 to $14,999 | 258 2.86 14.07
    $15,000 to $19,999 | 445 4.94 19.01
    $20,000 to $24,999 | 472 5.24 24.25
    $25,000 to $29,999 | 485 5.38 29.64
    $30,000 to $34,999 | 531 5.89 35.53
    $35,000 to $39,999 | 475 5.27 40.80
    $40,000 to $49,999 | 779 8.65 49.45
    $50,000 to $59,999 | 717 7.96 57.41
    $60,000 to $74,999 | 920 10.21 67.62
    $75,000 to $99,999 | 1,113 12.35 79.98
    $100,000 to $149,999 | 1,039 11.53 91.51
    $150,000 and over | 765 8.49 100.00
    ---------------------+-----------------------------------
    Total | 9,009 100.00

    . ta hefaminc, nol

    Edited: |
    Family |
    Income | Freq. Percent Cum.
    ------------+-----------------------------------
    1 | 303 3.36 3.36
    2 | 173 1.92 5.28
    3 | 245 2.72 8.00
    4 | 289 3.21 11.21
    5 | 258 2.86 14.07
    6 | 445 4.94 19.01
    7 | 472 5.24 24.25
    8 | 485 5.38 29.64
    9 | 531 5.89 35.53
    10 | 475 5.27 40.80
    11 | 779 8.65 49.45
    12 | 717 7.96 57.41
    13 | 920 10.21 67.62
    14 | 1,113 12.35 79.98
    15 | 1,039 11.53 91.51
    16 | 765 8.49 100.00
    ------------+-----------------------------------


    The only way I can think of doing this right now is something like the following

    recode hefamic (1=2500) (2=6250) and so on.

    1. However, I am not sure how to generate the midpoint of a category like (5000-7499). Should I add the two endpoints, 5000 and 7499 and divide them by 2, or is there some other formula?
    2. Secondly I was wondering whether there is a more elegant way of doing this, rather than generating each of the midpoints individually


    Thanks in advance!

    Monzur


    Reference:
    Zilanawala, A. (2014). Women’s Time Poverty and Family Structure Differences by Parenthood and Employment. Journal of Family Issues, 0192513X14542432.

  • #2
    Originally posted by Monzur Alam View Post
    1. However, I am not sure how to generate the midpoint of a category like (5000-7499). Should I add the two endpoints, 5000 and 7499 and divide them by 2, or is there some other formula?
    \[
    \frac{7499+5000}{2} = \frac{7499-5000}{2}+5000 \approx 6250
    \]

    You could give it the value 6249.5, but I would consider that false precision.

    Originally posted by Monzur Alam View Post
    2. Secondly I was wondering whether there is a more elegant way of doing this, rather than generating each of the midpoints individually
    Not easily with unequal bin widths. Even if a trick were possible it would just make your .do file harder to read, so I would recommend against it. What typically helps is to add line breaks in your do file, something like:
    Code:
    recode hefamic ( 1 =  2500 ) ///
                   ( 2 =  6250 ) ///
                   ( 3 =  8750 ) ///
                   ( 4 = 11250 )
    I would also urge you to think a bit more about the first category: less than 5000 is not the same as between 0 to 5000. Think of small business owners or farmers in a bad year: their income could be negative.
    Last edited by Maarten Buis; 10 Dec 2014, 02:12.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Maarten gives you excellent advice in response to your question. However, I would suggest that you think seriously about this imputation procedure (that's what it is). It is surely wrong for almost all respondents (a problem; "false precision" again), and yet what would you actually gain in regression modelling by having an apparently continuous variable as predictor? Why not simply leave the income categories 'as is', and enter the "income" categories using factor variable notation? Just because someone managed to get a paper published using this imputation procedure doesn't make it a good one. There may be other reasons for the recoding (e.g. you're not using income as a predictor as I suggest, but to do other things), but I would want to see them spelled out and justified.

      Comment


      • #4
        If this is used as a predictor, then not only will it be surely wrong for most respondents, but I would also be seriously concerned about the estimated standard error of such a predictor. I would expect the variance to be underestimated by quite a large amount. [On second thought, it might instead blow the s.e. up, because of the small observed variance ... I am not definite on this, but I am sure that one should somehow reflect the fact, that these values are not actually observed, but, as Stephen nails it, are indeed imputed.]

        If you intend to use this as your response/outcome/dependent variable, then you might want to consider interval regression, or an ordered logit (or probit) model.

        Best
        Daniel
        Last edited by daniel klein; 10 Dec 2014, 02:55.

        Comment


        • #5
          Just to pile it on: This kind of procedure is arbitrary beyond belief. 200,000? Why not 250,000? Think of the outliers you are creating and putting in arbitrary places. That is a recipe for arbitrary model fits.

          Comment


          • #6
            Originally posted by daniel klein View Post
            If this is used as a predictor, then not only will it be surely wrong for most respondents, but I would also be seriously concerned about the estimated standard error of such a predictor. I would expect the variance to be underestimated by quite a large amount. [On second thought, it might instead blow the s.e. up, because of the small observed variance ... I am not definite on this, but I am sure that one should somehow reflect the fact, that these values are not actually observed, but, as Stephen nails it, are indeed imputed.]
            As you can see in the simulation below, if the binning is fine-grained enough you get a fairly good approximation and the test statistics aren't off. It is only with wide bins that the test statistics are off (you reject a true null hypothesis too often). This simulation is delibarately well behaved and it is clear what the mid-point is for each bin including the first and the last. This is typically not true in the data presented in the quesiton, and that could lead to it's own set of problems as I and Nick already warned against.

            Code:
            clear all
            set seed 123456
            program define sim, rclass
                // create some data
                drop _all
                set obs 1000
                gen float x1 = rnormal()
                gen byte  x2 = runiform() < .5
                gen float y  = -1 + .5*x1 -2*x2 + rnormal(0,4)
                
                // original x1
                reg y x1 x2
                return scalar p = 2*ttail(e(df_r),abs(_b[x1] - 0.5)/_se[x1])
                
                // binned with width 0.5
                gen x1binned = floor(x1*2)
                gen x1imp = .25+.5*x1binned
                reg y x1imp x2
                return scalar phalf = 2*ttail(e(df_r),abs(_b[x1imp] - 0.5)/_se[x1imp])
                
                // binned with width 1
                drop x1binned x1imp
                gen x1binned = floor(x1)
                gen x1imp = .5+1*x1binned
                reg y x1imp x2
                return scalar pone = 2*ttail(e(df_r),abs(_b[x1imp] - 0.5)/_se[x1imp])
                
                // binned with width 1.5
                drop x1binned x1imp
                gen x1binned = floor(x1*2/3)
                gen x1imp = .75+1.5*x1binned
                reg y x1imp x2
                return scalar ponehalf = 2*ttail(e(df_r),abs(_b[x1imp] - 0.5)/_se[x1imp])
            end
            
            simulate p=r(p) phalf=r(phalf) pone=r(pone) ponehalf=r(ponehalf) , ///
                reps(20000): sim
            
             simpplot p*,                                 ///
                 overall reps(10000)                      ///
                 scheme(s2color) ylab(,angle(horizontal))
            Click image for larger version

Name:	Graph.png
Views:	1
Size:	16.2 KB
ID:	537845

            This simulation requires the simpplot package that can be downloaded from SSC. To do so type in Stata ssc install simpplot.
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              Maarten,

              thanks a lot for taking the time to investigate this empirically. These results are already informative. And thanks especially for posting the code, which gives us the possibility to vary other factors and see how that affects the results. I would think of e.g. insert a skewed distribution instead of a normal one, as might be the case for income, maybe let x1 and x2 be correlated, as would typically be the case in real-life data, etc.

              Best
              Daniel

              Comment


              • #8
                Thank you very much, Maarten, for your very thorough explanation. And thank you, everyone else for your suggestions-this is really helpful. I really should reconsider imputing the data.

                Comment


                • #9
                  Originally posted by Stephen Jenkins View Post
                  Maarten gives you excellent advice in response to your question. However, I would suggest that you think seriously about this imputation procedure (that's what it is). It is surely wrong for almost all respondents (a problem; "false precision" again), and yet what would you actually gain in regression modelling by having an apparently continuous variable as predictor? Why not simply leave the income categories 'as is', and enter the "income" categories using factor variable notation? Just because someone managed to get a paper published using this imputation procedure doesn't make it a good one. There may be other reasons for the recoding (e.g. you're not using income as a predictor as I suggest, but to do other things), but I would want to see them spelled out and justified.

                  Thank you for your suggestion, Stephen. I am using income as a predictor of self-rated health. However, I wasn't quite sure what you meant by entering "income" categories using factor variable notation. Did you mean creating dummy variables for all the 16 categories (and using one of them as reference)? Also, doesn't 15 categories for income seem too much for one regression model? Apologies if I sound naive-I am somewhat new to Stata and statistics.

                  Comment


                  • #10
                    I did mean using a full set of binary/dummy variables in your regression model. But if you have the "income" variable already defined as a categorical variable, you don't have to actually create the dummy variables: help fvvarlist. Using factor variables helps avoid potential errors in variable construction and can have other pay-offs, e.g. when calculating marginal effects. Having 16 categories isn't a big problem, assuming you have a relatively large sample (which is what I would expect from the American Time Use Sample ... unless you are looking at particular subgroups). For precision, the issue is degrees of freedom, not sample size per se. As it happens, after initial exploratory regressions, you might combine some categories ... but that's further down the track.

                    Comment


                    • #11
                      First off, I hate hate hate variables like this, but unfortunately they are pretty common.

                      But second, there have been papers written on dealing with such problems. Michael Hout, a pretty prominent sociologist/demographer/methodologist from Berkeley wrote "Getting the Most Out of the GSS Income Measures" (GSS = General Social Survey). See

                      http://publicdata.norc.org:41000/gss...20Measures.pdf

                      He says "The midpoints of the closed intervals are appropriate scores for those categories." The open ended intervals are more of a problem though. He suggests trying different strategies. He also suggests including a dummy variable for the top coded category, with the goal being to make it become statistically insignificant.

                      I can see just treating the income variable as categorical. But, that is much less parsimonious; and it seems to be wasting information about the values that you know fall within the interval. It is also harder to interpret than the effects of a continuous variable.

                      I think Maarten's simulations show that there is hope for the midpoint strategy. I would rather try to find ways to deal with the problematic intervals than to just abandon the strategy altogether.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      StataNow Version: 19.5 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        Originally posted by Stephen Jenkins View Post
                        I did mean using a full set of binary/dummy variables in your regression model. But if you have the "income" variable already defined as a categorical variable, you don't have to actually create the dummy variables: help fvvarlist. Using factor variables helps avoid potential errors in variable construction and can have other pay-offs, e.g. when calculating marginal effects. Having 16 categories isn't a big problem, assuming you have a relatively large sample (which is what I would expect from the American Time Use Sample ... unless you are looking at particular subgroups). For precision, the issue is degrees of freedom, not sample size per se. As it happens, after initial exploratory regressions, you might combine some categories ... but that's further down the track.
                        Many thanks, Stephen! I have managed to use the factor variable notation. Regarding the sample size, it seems reasonably large (around 7500)- I am looking at a subgroup of adult women from the last two survey rounds.

                        Thank you again.

                        Comment


                        • #13
                          Originally posted by Richard Williams View Post
                          First off, I hate hate hate variables like this, but unfortunately they are pretty common.

                          But second, there have been papers written on dealing with such problems. Michael Hout, a pretty prominent sociologist/demographer/methodologist from Berkeley wrote "Getting the Most Out of the GSS Income Measures" (GSS = General Social Survey). See

                          http://publicdata.norc.org:41000/gss...20Measures.pdf

                          He says "The midpoints of the closed intervals are appropriate scores for those categories." The open ended intervals are more of a problem though. He suggests trying different strategies. He also suggests including a dummy variable for the top coded category, with the goal being to make it become statistically insignificant.

                          I can see just treating the income variable as categorical. But, that is much less parsimonious; and it seems to be wasting information about the values that you know fall within the interval. It is also harder to interpret than the effects of a continuous variable.

                          I think Maarten's simulations show that there is hope for the midpoint strategy. I would rather try to find ways to deal with the problematic intervals than to just abandon the strategy altogether.
                          Thanks, Richard. I hadn't noticed this post before. I will look into Michael Hout's paper.

                          Comment


                          • #14
                            I both agree and disagree with Richard Williams. I agree that Michael Hout is a terrific quantitative sociologist. However, I disagree that his GSS note provides support for the midpoint imputation strategy. The sentence cited is made without any supporting evidence; it is simply a claim. It does not consider the issues that have been raised by Maarten, Daniel, and me. Also, Rich, I simply don't understand your remark about "wasting information about the values that you know fall within the interval". The point is that the only information you have is that the respondent's income value lies within the stated interval. Assuming the value is at a particular point within the interval is an extra step; an imputation. And, of course, it is harder to pick a value if the interval is open-ended. I also disagree that interpretation of covariate effects is "harder" to any great extent. Indeed one advantage of the categorical approach is that you start from a base model in which income may have a non-linear relationship with the outcome (rather than simply assuming it is linear, as default use of "income" as a continuous predictor would).
                            A more respectable approach to imputation that I would consider would be: fit a parametric model to the categorical data e.g. Dagum, Singh-Maddala, etc. etc. Then impute each person with an income using that fitted model, ensuring of course that you respect the constraint that each imputed value lies within the respondent's income category boundaries. Now repeat the process for each respondent M times, and thereby build M data sets. And then fit the regression model using multiple imputation methods. [For an implementation of this sort of approach, specifically to deal with top-coded income values in the US Current Population Survey, see e.g. Stephen P. Jenkins, Richard V. Burkhauser, Shuaizhang Feng, and Jeff Larrimore, ‘Measuring inequality using censored data: a multiple imputation approach’ , Journal of the Royal Statistical Society, Series A, 174 (1), January 2011, 63–81.]

                            Comment

                            Working...
                            X