how to use stata discretizing a continuous variable optimally?

BICHENG NIU

Join Date: Aug 2017

Posts: 33
#1

how to use stata discretizing a continuous variable optimally?

20 Sep 2021, 00:08

Hi guys,

I just meet an interesting problem. My boss asked me to recode a continuous variable, for example, age.
The final goal is to cut the age into some intervals which maximizing the difference of wage among the intervals.
I just wonder that are there any user-written command which can automatically do that?

I found some information on the web, there are some algorithms called "chi2 algorithm" which can compare the distribution of adjacent intervals to combine some trivial intervals.
Any similar command in stata?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

20 Sep 2021, 00:26

Under very specific conditions, group1d from SSC might help. Announced at https://www.stata.com/statalist/arch.../msg00883.html but read https://stats.stackexchange.com/ques...ets-e-g-income first.

An ugly but programmable way to approach it might be to loop over a series of t tests, or whatever test you prefer.
Comment
BICHENG NIU

Join Date: Aug 2017

Posts: 33
#3

20 Sep 2021, 00:36

Originally posted by Nick Cox View Post

Under very specific conditions, group1d from SSC might help. Announced at https://www.stata.com/statalist/arch.../msg00883.html but read https://stats.stackexchange.com/ques...ets-e-g-income first.

An ugly but programmable way to approach it might be to loop over a series of t tests, or whatever test you prefer.

Thank you Mr Cox, I tried to write a simple version, which should pre-specify a fixed interval range, for example: 0-1 1-2 2-3 ... etc, or 0-2 2-4...etc. Then I could create all the interval list and then perform F test to select a optimal plan. But what if the range is not fixed, may be 0-1, 1-4, 5-9 ... etc ? I'm not clear how to create a full interval list exhaustedly. The combination may be so complex.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#4

20 Sep 2021, 01:41

Originally posted by BICHENG NIU View Post

The final goal is to cut the age into some intervals which maximizing the difference of wage among the intervals.

Are you sure about that? If you want to do that, just make the intervals so small that any pattern you see will be dominated by random noise. That will make the differences between intervals big...

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

20 Sep 2021, 01:45

Overlapping intervals don't make sense to me. Otherwise, the problem is indeed wide open without some rules on exactly what you want.
Comment
BICHENG NIU

Join Date: Aug 2017

Posts: 33
#6

20 Sep 2021, 09:09

Originally posted by Nick Cox View Post

Overlapping intervals don't make sense to me. Otherwise, the problem is indeed wide open without some rules on exactly what you want.

um...you may misled me... I'm not going to generate an overlapping bins, I mean just like 1-3, 4-8, 9-15.... the width of the bins may vary but never overlapped.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

21 Sep 2021, 01:15

In #3 your examples

Code:

0-1 1-2 2-3 ... etc, 0-2 2-4...etc. .... 0-1, 1-4, 5-9

all include overlapping intervals as you've written them. Good to hear that you didn't mean what you said.

Otherwise this is a problem without non-trivial solutions unless you express some criteria. As Maarten Buis hints, to maximise differences between intervals and minimise differences within intervals, you can't improve on using the distinct observed values as their own intervals.

More constructively, I have made one specific suggestion -- use group1d -- which you haven't commented on.
Comment
BICHENG NIU

Join Date: Aug 2017

Posts: 33
#8

21 Sep 2021, 19:03

Originally posted by Nick Cox View Post

In #3 your examples

Code:

0-1 1-2 2-3 ... etc, 0-2 2-4...etc. .... 0-1, 1-4, 5-9

all include overlapping intervals as you've written them. Good to hear that you didn't mean what you said.

Otherwise this is a problem without non-trivial solutions unless you express some criteria. As Maarten Buis hints, to maximise differences between intervals and minimise differences within intervals, you can't improve on using the distinct observed values as their own intervals.

More constructively, I have made one specific suggestion -- use group1d -- which you haven't commented on.

Hi Cox, I read your suggestions on the group1d package and related materials. From my understanding ,the principal of group1d method is something like "unsupervised learning" (I use a term from machine learning) approach, grouping the values based on its relative locations. My problem is essentially an "supervised learning", I need an other variable (y) to group x.

Thanks for your reply~
Comment
BICHENG NIU

Join Date: Aug 2017

Posts: 33
#9

21 Sep 2021, 19:11

Originally posted by Nick Cox View Post

In #3 your examples

Code:

0-1 1-2 2-3 ... etc, 0-2 2-4...etc. .... 0-1, 1-4, 5-9

all include overlapping intervals as you've written them. Good to hear that you didn't mean what you said.

Otherwise this is a problem without non-trivial solutions unless you express some criteria. As Maarten Buis hints, to maximise differences between intervals and minimise differences within intervals, you can't improve on using the distinct observed values as their own intervals.

More constructively, I have made one specific suggestion -- use group1d -- which you haven't commented on.

I may put another example to clarify it. A common pattern of the wage regarding to age is that the wage may go up then go down as the age increases. As the age and wage are both continous variable, suppose I want to discretize the age variable (find a set of optimal bins) to maximize the inter-group variation of wage over different age bins.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#10

21 Sep 2021, 19:13

I don't see it that way. If you first reduce your data to (age, mean wage) then the problem addressed is binning ages according to mean wage. This is addressed in the Cross Validated thread. The original applications were to splitting time series, but time only plays the role of defining intervals.

That said, age and wage is perhaps the least convincing application of this method I've heard about. I am not an economist, but it seems utterly standard that (a) age and wage data are very noisy given all the other predictors that influence wage (b) as a rough empiricism the mean wage varies fairly smoothly with age (often quadratics are used to inject some curvature). That being so, binning noisy and continuous data is unlikely to be especially successful. However, you have told us nothing about your data, which may be confidential, so this is just speculation.

EDIT: Crossed with #9 but there is some overlap. If your perception is of continuity, why expect or even seek "optimal bins"?

Last edited by Nick Cox; 21 Sep 2021, 19:15.
Comment

Announcement

how to use stata discretizing a continuous variable optimally?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment