Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Decile as dependent variable + what should be the right model

    Hi,

    I would like to ask the following:

    My model is initially OLS, with poverty incidence as the dependent variable and a vector of predictors.

    I had been advised to consider transforming the poverty incidence into deciles and use it as the Y. Thus far, I have been seeing "ordered logit" that could fit it, i.e. 10 categories representing the deciles. However, based on my readings, dependent variables in ologit are usually up to three only. I haven't seen 10 categories. I would like to ask if there is a limit to the number of categories on the LHS/ dependent side?

    Also, are there alternate models where I could use decile version of a continuous variable as the dependent?

    Thank you very much.

  • #2
    I had been advised to consider transforming the poverty incidence into deciles and use it as the Y.
    Can you elaborate on that? To me it sounds like bad advice. There are hardly any circumstances where converting a continuous variables to deciles improves the analysis, and it often seriously degrades the information. What problem are you having that deciles purport to solve? There is probably a better way.

    Comment


    • #3
      I'm not sure what you mean by "poverty incidence." You should tell us a little more about your data. Assuming that you have proportion in poverty for some unit of analysis such as a census tract where the variables is bounded by zero and one, you probably want to look at fractional regression. Do help fracreg for details.
      Richard T. Campbell
      Emeritus Professor of Biostatistics and Sociology
      University of Illinois at Chicago

      Comment


      • #4
        Thank you for the response.

        1. Poverty incidence = proportion of the population of a certain administrative boundary (e.g. province) below the poverty threshold / poverty line, say people living with $2 or less a day.

        2. I am testing whether there is a link between poverty incidence and accessibility measures. My initial model is OLS, with 120 observations (cross-section).

        3. "What problem are you having that deciles purport to solve?" -- I was advised to consider other models and check how the results compare to the OLS results, for robustness.

        4. Instead of saying, the poverty incidence increases or decreases with changes in the predictors, we'd like to also consider the "probability" that a province is classified under a certain poverty decile, e.g. least poor.

        I hope that clarifies. If not, I would be glad to answer follow up questions as I myself would like to have a clear direction on my thesis.

        Thank you very much.

        Comment


        • #5
          With regard to #4, you can simply do:

          Code:
          summ poverity_incidence, detail
          gen byte poorest = poverty_incidence < `r(p10)'
          logistic poorest ...
          You don't need all ten deciles for that. If you really do want all ten deciles, the -xtile- command will group your data into deciles. See -help xtile-. But it still sounds like a poor solution to me. How do your residual scatterplots look? Is your R2 acceptable for this kind of problem? If it's a good model, I'd stick with it. If it shows appreciable deficiencies, I'd target my efforts on fixing those through things like suitable transformations of some of your continuous variables variables (things like log or square root,or cube root, or something continuous but not throwing away information by making them categorical), or including interaction terms or higher-order terms among the covariates. But #3 as stated still strikes me as bad advice. Maybe somebody else can respond with good ways to implement it and we'll both learn something.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            With regard to #4, you can simply do:

            Code:
            summ poverity_incidence, detail
            gen byte poorest = poverty_incidence < `r(p10)'
            logistic poorest ...
            You don't need all ten deciles for that. If you really do want all ten deciles, the -xtile- command will group your data into deciles. See -help xtile-. But it still sounds like a poor solution to me. How do your residual scatterplots look? Is your R2 acceptable for this kind of problem? If it's a good model, I'd stick with it. If it shows appreciable deficiencies, I'd target my efforts on fixing those through things like suitable transformations of some of your continuous variables variables (things like log or square root,or cube root, or something continuous but not throwing away information by making them categorical), or including interaction terms or higher-order terms among the covariates. But #3 as stated still strikes me as bad advice. Maybe somebody else can respond with good ways to implement it and we'll both learn something.
            Hello again,

            I'm sticking with OLS with 1500+ observations.

            1. I am doing post estimation diagnostics such as predict (xb and r), and found out that my error is non-normal. Can I assume, based on CLT, that it's ok to proceed with interpreting my coefficients?

            Click image for larger version

Name:	Screen Shot 2017-02-14 at 1.31.53 PM.png
Views:	1
Size:	114.1 KB
ID:	1374180


            2. Also, my Y (predictor = poverty incidence) is not normally distributed. Is it okay, by CLT, to still proceed? If not, should I resort to transformation? Would there be any "conceptual" explanation to the suitability of transformation? Please note that I have done the ladder and gladder commands in STATA and discovered that the one with the lowest chi^2 is a square root transformation. But I am not sure how to interpret the square root transformation of a poverty incidence (proportion of the population below the poverty threshold), and its implication in the interpretation of a significant regression coeff, or if it is even appropriate.

            Click image for larger version

Name:	Screen Shot 2017-02-14 at 1.33.45 PM.png
Views:	1
Size:	31.7 KB
ID:	1374182


            Click image for larger version

Name:	Screen Shot 2017-02-14 at 1.32.52 PM.png
Views:	1
Size:	115.1 KB
ID:	1374181


            Comment


            • #7
              The degree of departure from normality you show for the error distribution is not a problem in a sample of 1500. The CLT will carry the day for you.

              The non-normality of your Y variable is irrelevant.

              A general comment about these normality-diagnostic tests. They are really not very helpful. In small to moderate samples they are not very sensitive, and in large samples they are not sufficiently specific. OLS linear regression is rather robust to violations of the normality assumption. If your sample is large enough that these tests can find the non-normality in your error distribution, the central limit theorem will rescue the analysis anyway. If your sample is not large enough for the central limit theorem to kick in, these tests will probably fail to detect the departure from normality anyway.

              Nick Cox is fond of saying (I paraphrase here) that performing these normality tests to decide whether you can accept an OLS regression is a bit like sending a rowboat out into choppy waters to see if it is safe for the Queen Mary to sail.

              Comment


              • #8
                The image belongs to George Box in Biometrika in 1953. His discussion was about testing for unequal variances before comparing means, but the sentiment is similar. There is little point in testing for problems that won't bite.

                I think that looking at marginal distributions is pertinent as you should want to know what about your data might be problematic.

                But I see much worry about getting marginal distributions well behaved when the bigger deal is getting closer to models that will track the pattern in your data, that is the relationship between response and predictors. It's serenditous that (e.g.) a log transformation that might be indicated by very skew distributions often also makes sense as a scale on which to model the data anyway.

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  The degree of departure from normality you show for the error distribution is not a problem in a sample of 1500. The CLT will carry the day for you.

                  The non-normality of your Y variable is irrelevant.

                  A general comment about these normality-diagnostic tests. They are really not very helpful. In small to moderate samples they are not very sensitive, and in large samples they are not sufficiently specific. OLS linear regression is rather robust to violations of the normality assumption. If your sample is large enough that these tests can find the non-normality in your error distribution, the central limit theorem will rescue the analysis anyway. If your sample is not large enough for the central limit theorem to kick in, these tests will probably fail to detect the departure from normality anyway.

                  Nick Cox is fond of saying (I paraphrase here) that performing these normality tests to decide whether you can accept an OLS regression is a bit like sending a rowboat out into choppy waters to see if it is safe for the Queen Mary to sail.
                  All of your explanations have been enlightening. Thank you.

                  Comment


                  • #10
                    #8 spelling should be "serendipitous"

                    Comment


                    • #11
                      Estimating a linear model by OLS is fine. But putting the variable into deciles and using ordered logit is not fine. First, as Clyde says, it throws away information. Now, sometimes throwing away information and seeing the result is worthwhile, but I don't think so in this case. Why? Two reasons.

                      1. A linear model estimated by OLS can be a good approximation. And it works regardless of the actual distribution of y given x. With y being a fraction its conditional distribution will be far from normal, but the CLT with N = 1,500 will provide a good approximation for inference. If you use ologit, you'd be assuming the underlying distribution of y is logistic, and there is no heteroskedasticity. Therefore, if you get different answers from the linear model, you can't really know why. Is there something wrong with the linear approximation? Or is the homoskedastic distribution wrong?

                      2. Even if you feel comfortable with the homoskedastic logistic assumption, ologit, as usually implemented, will estimate cut points. But you know the cut points because you are censoring the data. Thus, if you want to pursue this route, you should use -intreg-, which means you have specified the intervals -- they are not estimated by the data. But -intreg- assumes a homoskedastic normal distribution, so see issue (1).

                      You should do what Dick Campbell suggested. This is a perfect place to estimate a fractional logit or fractional probit model. Compute robust standard errors and use -margins- to get the average marginal effects. You can compare these with the OLS coefficients. I suspect they will be pretty close.

                      JW

                      Comment


                      • #12
                        Thanks for the great discussion. This is exactly the question I am trying to figure out now. I have another related question. If I am worried that my dependent variable, which is a ratio between 0 and 1, is noisy due to the nature of the measure, could I still do decile-ranking and apply fractional regression?

                        Comment


                        • #13
                          The response being bounded makes no difference to the advice above. Binning data throws away information in an arbitrary way.

                          Comment


                          • #14
                            Originally posted by Nick Cox View Post
                            The response being bounded makes no difference to the advice above. Binning data throws away information in an arbitrary way.
                            thanks! Nick, I agree that putting the variable into deciles would cause the loss of information. The reason I asked the question is that if I use the continuous raw values, my model's reported pseudo R-square is a bit low, which is my concern. Do you have any suggestion?

                            Comment


                            • #15
                              Consider that if you can reduce bivariate data to two summary points which differ on both variables then the correlation is perfect at 1. More generally: sure, if what you discard is mostly noise, your figures of merit might look better, but that's dubious at best and spurious at worst.

                              If you can give a reason why using deciles is good, but at the same time using say quintiles or ventiles is a really bad idea, you might have a convincing reason. Otherwise this all sounds a fudge to me, but you've not explained exactly what you imagine in detail, so I might be missing the point.

                              Comment

                              Working...
                              X