How to convert continuous variable from categorical variable by assigning midpoints?

Monzur Alam

Join Date: Dec 2014

Posts: 55
#1

How to convert continuous variable from categorical variable by assigning midpoints?

09 Dec 2014, 23:00

Hi Statalisters,

I am looking at the American Time Use dataset, where the variable for household income is categorical and has 16 categories (as shown in the tables below). I would like to convert this to a continuous variable, similar to what Zilanwala (2014), p. 10-11 does. Here is what she says she has done:

“Income is converted from these categorical responses to dollar amounts by assigning the midpoint of each category and representing income in thousands of dollars. The last category is topcoded to $200,000”

ta hefaminc

Edited: Family |
Income | Freq. Percent Cum.
---------------------+-----------------------------------
Less than $5,000 | 303 3.36 3.36
$5,000 to $7,499 | 173 1.92 5.28
$7,500 to $9,999 | 245 2.72 8.00
$10,000 to $12,499 | 289 3.21 11.21
$12,500 to $14,999 | 258 2.86 14.07
$15,000 to $19,999 | 445 4.94 19.01
$20,000 to $24,999 | 472 5.24 24.25
$25,000 to $29,999 | 485 5.38 29.64
$30,000 to $34,999 | 531 5.89 35.53
$35,000 to $39,999 | 475 5.27 40.80
$40,000 to $49,999 | 779 8.65 49.45
$50,000 to $59,999 | 717 7.96 57.41
$60,000 to $74,999 | 920 10.21 67.62
$75,000 to $99,999 | 1,113 12.35 79.98
$100,000 to $149,999 | 1,039 11.53 91.51
$150,000 and over | 765 8.49 100.00
---------------------+-----------------------------------
Total | 9,009 100.00

. ta hefaminc, nol

Edited: |
Family |
Income | Freq. Percent Cum.
------------+-----------------------------------
1 | 303 3.36 3.36
2 | 173 1.92 5.28
3 | 245 2.72 8.00
4 | 289 3.21 11.21
5 | 258 2.86 14.07
6 | 445 4.94 19.01
7 | 472 5.24 24.25
8 | 485 5.38 29.64
9 | 531 5.89 35.53
10 | 475 5.27 40.80
11 | 779 8.65 49.45
12 | 717 7.96 57.41
13 | 920 10.21 67.62
14 | 1,113 12.35 79.98
15 | 1,039 11.53 91.51
16 | 765 8.49 100.00
------------+-----------------------------------

The only way I can think of doing this right now is something like the following

recode hefamic (1=2500) (2=6250) and so on.

1. However, I am not sure how to generate the midpoint of a category like (5000-7499). Should I add the two endpoints, 5000 and 7499 and divide them by 2, or is there some other formula?
2. Secondly I was wondering whether there is a more elegant way of doing this, rather than generating each of the midpoints individually

Thanks in advance!

Monzur

Reference:
Zilanawala, A. (2014). Women’s Time Poverty and Family Structure Differences by Parenthood and Employment. Journal of Family Issues, 0192513X14542432.
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#2

10 Dec 2014, 02:08

Originally posted by Monzur Alam View Post

1. However, I am not sure how to generate the midpoint of a category like (5000-7499). Should I add the two endpoints, 5000 and 7499 and divide them by 2, or is there some other formula?

\[
\frac{7499+5000}{2} = \frac{7499-5000}{2}+5000 \approx 6250
\]

You could give it the value 6249.5, but I would consider that false precision.

Originally posted by Monzur Alam View Post

2. Secondly I was wondering whether there is a more elegant way of doing this, rather than generating each of the midpoints individually

Not easily with unequal bin widths. Even if a trick were possible it would just make your .do file harder to read, so I would recommend against it. What typically helps is to add line breaks in your do file, something like:

Code:

recode hefamic ( 1 = 2500 ) /// ( 2 = 6250 ) /// ( 3 = 8750 ) /// ( 4 = 11250 )

I would also urge you to think a bit more about the first category: less than 5000 is not the same as between 0 to 5000. Think of small business owners or farmers in a bad year: their income could be negative.

Last edited by Maarten Buis; 10 Dec 2014, 02:12.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#3

10 Dec 2014, 02:31

Maarten gives you excellent advice in response to your question. However, I would suggest that you think seriously about this imputation procedure (that's what it is). It is surely wrong for almost all respondents (a problem; "false precision" again), and yet what would you actually gain in regression modelling by having an apparently continuous variable as predictor? Why not simply leave the income categories 'as is', and enter the "income" categories using factor variable notation? Just because someone managed to get a paper published using this imputation procedure doesn't make it a good one. There may be other reasons for the recoding (e.g. you're not using income as a predictor as I suggest, but to do other things), but I would want to see them spelled out and justified.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3856
#4

10 Dec 2014, 02:50

If this is used as a predictor, then not only will it be surely wrong for most respondents, but I would also be seriously concerned about the estimated standard error of such a predictor. I would expect the variance to be underestimated by quite a large amount. [On second thought, it might instead blow the s.e. up, because of the small observed variance ... I am not definite on this, but I am sure that one should somehow reflect the fact, that these values are not actually observed, but, as Stephen nails it, are indeed imputed.]

If you intend to use this as your response/outcome/dependent variable, then you might want to consider interval regression, or an ordered logit (or probit) model.

Best
Daniel

Last edited by daniel klein; 10 Dec 2014, 02:55.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35700
#5

10 Dec 2014, 03:27

Just to pile it on: This kind of procedure is arbitrary beyond belief. 200,000? Why not 250,000? Think of the outliers you are creating and putting in arbitrary places. That is a recipe for arbitrary model fits.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#6

10 Dec 2014, 05:14

Originally posted by daniel klein View Post

If this is used as a predictor, then not only will it be surely wrong for most respondents, but I would also be seriously concerned about the estimated standard error of such a predictor. I would expect the variance to be underestimated by quite a large amount. [On second thought, it might instead blow the s.e. up, because of the small observed variance ... I am not definite on this, but I am sure that one should somehow reflect the fact, that these values are not actually observed, but, as Stephen nails it, are indeed imputed.]

As you can see in the simulation below, if the binning is fine-grained enough you get a fairly good approximation and the test statistics aren't off. It is only with wide bins that the test statistics are off (you reject a true null hypothesis too often). This simulation is delibarately well behaved and it is clear what the mid-point is for each bin including the first and the last. This is typically not true in the data presented in the quesiton, and that could lead to it's own set of problems as I and Nick already warned against.

Code:

clear all set seed 123456 program define sim, rclass // create some data drop _all set obs 1000 gen float x1 = rnormal() gen byte x2 = runiform() < .5 gen float y = -1 + .5*x1 -2*x2 + rnormal(0,4) // original x1 reg y x1 x2 return scalar p = 2*ttail(e(df_r),abs(_b[x1] - 0.5)/_se[x1]) // binned with width 0.5 gen x1binned = floor(x1*2) gen x1imp = .25+.5*x1binned reg y x1imp x2 return scalar phalf = 2*ttail(e(df_r),abs(_b[x1imp] - 0.5)/_se[x1imp]) // binned with width 1 drop x1binned x1imp gen x1binned = floor(x1) gen x1imp = .5+1*x1binned reg y x1imp x2 return scalar pone = 2*ttail(e(df_r),abs(_b[x1imp] - 0.5)/_se[x1imp]) // binned with width 1.5 drop x1binned x1imp gen x1binned = floor(x1*2/3) gen x1imp = .75+1.5*x1binned reg y x1imp x2 return scalar ponehalf = 2*ttail(e(df_r),abs(_b[x1imp] - 0.5)/_se[x1imp]) end simulate p=r(p) phalf=r(phalf) pone=r(pone) ponehalf=r(ponehalf) , /// reps(20000): sim simpplot p*, /// overall reps(10000) /// scheme(s2color) ylab(,angle(horizontal))

This simulation requires the simpplot package that can be downloaded from SSC. To do so type in Stata ssc install simpplot.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
daniel klein

Join Date: Mar 2014

Posts: 3856
#7

10 Dec 2014, 06:18

Maarten,

thanks a lot for taking the time to investigate this empirically. These results are already informative. And thanks especially for posting the code, which gives us the possibility to vary other factors and see how that affects the results. I would think of e.g. insert a skewed distribution instead of a normal one, as might be the case for income, maybe let x1 and x2 be correlated, as would typically be the case in real-life data, etc.

Best
Daniel
Comment
Monzur Alam

Join Date: Dec 2014

Posts: 55
#8

10 Dec 2014, 09:42

Thank you very much, Maarten, for your very thorough explanation. And thank you, everyone else for your suggestions-this is really helpful. I really should reconsider imputing the data.
Comment
Monzur Alam

Join Date: Dec 2014

Posts: 55
#9

10 Dec 2014, 09:47

Originally posted by Stephen Jenkins View Post

Maarten gives you excellent advice in response to your question. However, I would suggest that you think seriously about this imputation procedure (that's what it is). It is surely wrong for almost all respondents (a problem; "false precision" again), and yet what would you actually gain in regression modelling by having an apparently continuous variable as predictor? Why not simply leave the income categories 'as is', and enter the "income" categories using factor variable notation? Just because someone managed to get a paper published using this imputation procedure doesn't make it a good one. There may be other reasons for the recoding (e.g. you're not using income as a predictor as I suggest, but to do other things), but I would want to see them spelled out and justified.

Thank you for your suggestion, Stephen. I am using income as a predictor of self-rated health. However, I wasn't quite sure what you meant by entering "income" categories using factor variable notation. Did you mean creating dummy variables for all the 16 categories (and using one of them as reference)? Also, doesn't 15 categories for income seem too much for one regression model? Apologies if I sound naive-I am somewhat new to Stata and statistics.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#10

10 Dec 2014, 11:18

I did mean using a full set of binary/dummy variables in your regression model. But if you have the "income" variable already defined as a categorical variable, you don't have to actually create the dummy variables: help fvvarlist. Using factor variables helps avoid potential errors in variable construction and can have other pay-offs, e.g. when calculating marginal effects. Having 16 categories isn't a big problem, assuming you have a relatively large sample (which is what I would expect from the American Time Use Sample ... unless you are looking at particular subgroups). For precision, the issue is degrees of freedom, not sample size per se. As it happens, after initial exploratory regressions, you might combine some categories ... but that's further down the track.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4998
#11

10 Dec 2014, 11:34

First off, I hate hate hate variables like this, but unfortunately they are pretty common.

But second, there have been papers written on dealing with such problems. Michael Hout, a pretty prominent sociologist/demographer/methodologist from Berkeley wrote "Getting the Most Out of the GSS Income Measures" (GSS = General Social Survey). See

http://publicdata.norc.org:41000/gss...20Measures.pdf

He says "The midpoints of the closed intervals are appropriate scores for those categories." The open ended intervals are more of a problem though. He suggests trying different strategies. He also suggests including a dummy variable for the top coded category, with the goal being to make it become statistically insignificant.

I can see just treating the income variable as categorical. But, that is much less parsimonious; and it seems to be wasting information about the values that you know fall within the interval. It is also harder to interpret than the effects of a continuous variable.

I think Maarten's simulations show that there is hope for the midpoint strategy. I would rather try to find ways to deal with the problematic intervals than to just abandon the strategy altogether.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Monzur Alam

Join Date: Dec 2014

Posts: 55
#12

10 Dec 2014, 11:39

Originally posted by Stephen Jenkins View Post

I did mean using a full set of binary/dummy variables in your regression model. But if you have the "income" variable already defined as a categorical variable, you don't have to actually create the dummy variables: help fvvarlist. Using factor variables helps avoid potential errors in variable construction and can have other pay-offs, e.g. when calculating marginal effects. Having 16 categories isn't a big problem, assuming you have a relatively large sample (which is what I would expect from the American Time Use Sample ... unless you are looking at particular subgroups). For precision, the issue is degrees of freedom, not sample size per se. As it happens, after initial exploratory regressions, you might combine some categories ... but that's further down the track.

Many thanks, Stephen! I have managed to use the factor variable notation. Regarding the sample size, it seems reasonably large (around 7500)- I am looking at a subgroup of adult women from the last two survey rounds.

Thank you again.
Comment
Monzur Alam

Join Date: Dec 2014

Posts: 55
#13

10 Dec 2014, 17:46

Originally posted by Richard Williams View Post

First off, I hate hate hate variables like this, but unfortunately they are pretty common.

But second, there have been papers written on dealing with such problems. Michael Hout, a pretty prominent sociologist/demographer/methodologist from Berkeley wrote "Getting the Most Out of the GSS Income Measures" (GSS = General Social Survey). See

http://publicdata.norc.org:41000/gss...20Measures.pdf

He says "The midpoints of the closed intervals are appropriate scores for those categories." The open ended intervals are more of a problem though. He suggests trying different strategies. He also suggests including a dummy variable for the top coded category, with the goal being to make it become statistically insignificant.

I can see just treating the income variable as categorical. But, that is much less parsimonious; and it seems to be wasting information about the values that you know fall within the interval. It is also harder to interpret than the effects of a continuous variable.

I think Maarten's simulations show that there is hope for the midpoint strategy. I would rather try to find ways to deal with the problematic intervals than to just abandon the strategy altogether.

Thanks, Richard. I hadn't noticed this post before. I will look into Michael Hout's paper.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#14

11 Dec 2014, 02:52

I both agree and disagree with Richard Williams. I agree that Michael Hout is a terrific quantitative sociologist. However, I disagree that his GSS note provides support for the midpoint imputation strategy. The sentence cited is made without any supporting evidence; it is simply a claim. It does not consider the issues that have been raised by Maarten, Daniel, and me. Also, Rich, I simply don't understand your remark about "wasting information about the values that you know fall within the interval". The point is that the only information you have is that the respondent's income value lies within the stated interval. Assuming the value is at a particular point within the interval is an extra step; an imputation. And, of course, it is harder to pick a value if the interval is open-ended. I also disagree that interpretation of covariate effects is "harder" to any great extent. Indeed one advantage of the categorical approach is that you start from a base model in which income may have a non-linear relationship with the outcome (rather than simply assuming it is linear, as default use of "income" as a continuous predictor would).
A more respectable approach to imputation that I would consider would be: fit a parametric model to the categorical data e.g. Dagum, Singh-Maddala, etc. etc. Then impute each person with an income using that fitted model, ensuring of course that you respect the constraint that each imputed value lies within the respondent's income category boundaries. Now repeat the process for each respondent M times, and thereby build M data sets. And then fit the regression model using multiple imputation methods. [For an implementation of this sort of approach, specifically to deal with top-coded income values in the US Current Population Survey, see e.g. Stephen P. Jenkins, Richard V. Burkhauser, Shuaizhang Feng, and Jeff Larrimore, ‘Measuring inequality using censored data: a multiple imputation approach’ , Journal of the Royal Statistical Society, Series A, 174 (1), January 2011, 63–81.]
Comment

Announcement

How to convert continuous variable from categorical variable by assigning midpoints?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment