Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to use stata discretizing a continuous variable optimally?

    Hi guys,

    I just meet an interesting problem. My boss asked me to recode a continuous variable, for example, age.
    The final goal is to cut the age into some intervals which maximizing the difference of wage among the intervals.
    I just wonder that are there any user-written command which can automatically do that?

    I found some information on the web, there are some algorithms called "chi2 algorithm" which can compare the distribution of adjacent intervals to combine some trivial intervals.
    Any similar command in stata?




  • #2
    Under very specific conditions, group1d from SSC might help. Announced at https://www.stata.com/statalist/arch.../msg00883.html but read https://stats.stackexchange.com/ques...ets-e-g-income first.

    An ugly but programmable way to approach it might be to loop over a series of t tests, or whatever test you prefer.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Under very specific conditions, group1d from SSC might help. Announced at https://www.stata.com/statalist/arch.../msg00883.html but read https://stats.stackexchange.com/ques...ets-e-g-income first.

      An ugly but programmable way to approach it might be to loop over a series of t tests, or whatever test you prefer.
      Thank you Mr Cox, I tried to write a simple version, which should pre-specify a fixed interval range, for example: 0-1 1-2 2-3 ... etc, or 0-2 2-4...etc. Then I could create all the interval list and then perform F test to select a optimal plan. But what if the range is not fixed, may be 0-1, 1-4, 5-9 ... etc ? I'm not clear how to create a full interval list exhaustedly. The combination may be so complex.

      Comment


      • #4
        Originally posted by BICHENG NIU View Post
        The final goal is to cut the age into some intervals which maximizing the difference of wage among the intervals.
        Are you sure about that? If you want to do that, just make the intervals so small that any pattern you see will be dominated by random noise. That will make the differences between intervals big...

        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Overlapping intervals don't make sense to me. Otherwise, the problem is indeed wide open without some rules on exactly what you want.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            Overlapping intervals don't make sense to me. Otherwise, the problem is indeed wide open without some rules on exactly what you want.
            um...you may misled me... I'm not going to generate an overlapping bins, I mean just like 1-3, 4-8, 9-15.... the width of the bins may vary but never overlapped.

            Comment


            • #7
              In #3 your examples

              Code:
               0-1 1-2 2-3 ... etc,  0-2 2-4...etc.  ....  0-1, 1-4, 5-9
              all include overlapping intervals as you've written them. Good to hear that you didn't mean what you said.

              Otherwise this is a problem without non-trivial solutions unless you express some criteria. As Maarten Buis hints, to maximise differences between intervals and minimise differences within intervals, you can't improve on using the distinct observed values as their own intervals.

              More constructively, I have made one specific suggestion -- use group1d -- which you haven't commented on.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                In #3 your examples

                Code:
                 0-1 1-2 2-3 ... etc, 0-2 2-4...etc. .... 0-1, 1-4, 5-9
                all include overlapping intervals as you've written them. Good to hear that you didn't mean what you said.

                Otherwise this is a problem without non-trivial solutions unless you express some criteria. As Maarten Buis hints, to maximise differences between intervals and minimise differences within intervals, you can't improve on using the distinct observed values as their own intervals.

                More constructively, I have made one specific suggestion -- use group1d -- which you haven't commented on.
                Hi Cox, I read your suggestions on the group1d package and related materials. From my understanding ,the principal of group1d method is something like "unsupervised learning" (I use a term from machine learning) approach, grouping the values based on its relative locations. My problem is essentially an "supervised learning", I need an other variable (y) to group x.


                Thanks for your reply~

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  In #3 your examples

                  Code:
                   0-1 1-2 2-3 ... etc, 0-2 2-4...etc. .... 0-1, 1-4, 5-9
                  all include overlapping intervals as you've written them. Good to hear that you didn't mean what you said.

                  Otherwise this is a problem without non-trivial solutions unless you express some criteria. As Maarten Buis hints, to maximise differences between intervals and minimise differences within intervals, you can't improve on using the distinct observed values as their own intervals.

                  More constructively, I have made one specific suggestion -- use group1d -- which you haven't commented on.
                  I may put another example to clarify it. A common pattern of the wage regarding to age is that the wage may go up then go down as the age increases. As the age and wage are both continous variable, suppose I want to discretize the age variable (find a set of optimal bins) to maximize the inter-group variation of wage over different age bins.

                  Comment


                  • #10
                    I don't see it that way. If you first reduce your data to (age, mean wage) then the problem addressed is binning ages according to mean wage. This is addressed in the Cross Validated thread. The original applications were to splitting time series, but time only plays the role of defining intervals.

                    That said, age and wage is perhaps the least convincing application of this method I've heard about. I am not an economist, but it seems utterly standard that (a) age and wage data are very noisy given all the other predictors that influence wage (b) as a rough empiricism the mean wage varies fairly smoothly with age (often quadratics are used to inject some curvature). That being so, binning noisy and continuous data is unlikely to be especially successful. However, you have told us nothing about your data, which may be confidential, so this is just speculation.

                    EDIT: Crossed with #9 but there is some overlap. If your perception is of continuity, why expect or even seek "optimal bins"?
                    Last edited by Nick Cox; 21 Sep 2021, 19:15.

                    Comment

                    Working...
                    X