Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimation Technique for Regression Equation with Continuous Dependent Variable and Categorical explanatory variable

    Hello Statalist,

    I have used OLS in estimating regression equations where the dependent variable is continuous (lastpay) and the explanatory variable is a categorical variable (mainsectoractivity with 14 categories). I have followed Stata's example (mealcat) in manually creating the dummy variable so that I could omit the category of choice (i.e. 14th category) in the regression analysis.

    Please, I want to know if there is a more suitable technique order than OLS that I can use because my supervisor is of the opinion that OLS is not feasible in this situation.

    The summary statistics data is posted below:

    . summarize zone lastpay mainactivitysector

    Variable Obs Mean Std. dev. Min Max

    zone 2,559 3.745213 1.362958 1 6
    lastpay 2,559 17175.74 35435.27 100 800000
    mainactivi~r 2,559 5.4873 4.411378 1 14


    The sample data is posted below:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float loglastpay byte(m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14)
    10.308952 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     10.44967 0 0 0 1 0 0 0 0 0 0 0 0 0 0
     9.546813 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     7.824046 0 0 1 0 0 0 0 0 0 0 0 0 0 0
     8.006368 0 0 1 0 0 0 0 0 0 0 0 0 0 0
    10.645425 0 0 0 0 0 0 0 0 0 0 0 0 1 0
    10.545341 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     9.798127 0 0 0 0 0 0 0 0 0 0 0 0 0 1
     8.987197 0 0 1 0 0 0 0 0 0 0 0 0 0 0
     10.92953 0 0 0 0 0 0 0 0 0 0 0 1 0 0
      6.39693 0 0 0 0 0 0 0 1 0 0 0 0 0 0
     8.987197 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     9.392662 0 0 0 0 0 0 0 0 0 0 1 0 0 0
    10.165852 0 0 0 0 0 0 0 0 0 0 0 0 0 1
     8.853665 0 0 0 0 0 0 0 0 0 1 0 0 0 0
     9.798127 0 0 0 0 0 1 0 0 0 0 0 0 0 0
     9.903487 1 0 0 0 0 0 0 0 0 0 0 0 0 0
    10.596635 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.392662 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     5.703783 0 0 1 0 0 0 0 0 0 0 0 0 0 0
     7.600903 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.392662 0 0 0 0 0 0 0 0 0 0 0 1 0 0
     7.600903 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.615806 1 0 0 0 0 0 0 0 0 0 0 0 0 0
    10.308952 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     7.824046 0 0 0 0 0 0 1 0 0 0 0 0 0 0
    13.122363 0 0 0 0 0 0 1 0 0 0 0 0 0 0
     6.214608 0 0 0 0 0 0 0 0 0 1 0 0 0 0
     8.294049 0 0 1 0 0 0 0 0 0 0 0 0 0 0
     9.740969 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.305651 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     8.699514 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     8.006368 0 0 1 0 0 0 0 0 0 0 0 0 0 0
     7.600903 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    10.463103 0 0 0 0 0 0 0 0 0 0 0 0 1 0
      6.39693 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     8.006368 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.615806 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     8.006368 0 1 0 0 0 0 0 0 0 0 0 0 0 0
    10.545341 0 0 0 0 0 0 0 0 0 0 0 0 1 0
    10.714417 0 0 0 0 0 0 0 1 0 0 0 0 0 0
    9.2103405 0 0 0 0 0 0 0 0 0 0 1 0 0 0
    11.225244 0 1 0 0 0 0 0 0 0 0 0 0 0 0
     9.648595 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     10.37349 0 0 0 0 0 0 0 0 0 0 1 0 0 0
      9.62905 0 0 0 0 0 0 0 0 0 0 1 0 0 0
      6.39693 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     10.13579 0 0 0 0 0 1 0 0 0 0 0 0 0 0
     10.65834 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     9.546813 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     9.998797 0 0 0 0 0 0 0 0 0 0 0 0 0 1
     10.12663 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     8.612503 0 0 0 0 0 0 0 1 0 0 0 0 0 0
     7.313221 0 0 1 0 0 0 0 0 0 0 0 0 0 0
    10.308952 0 0 0 0 0 1 0 0 0 0 0 0 0 0
     10.91509 0 0 0 0 0 0 0 0 0 1 0 0 0 0
     5.298317 0 0 0 0 0 0 0 1 0 0 0 0 0 0
     9.798127 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     10.37349 0 0 0 0 0 0 0 0 0 0 0 0 1 0
    10.308952 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     8.517193 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.615806 0 0 0 0 0 0 0 0 0 0 0 1 0 0
     6.109248 0 0 0 0 0 0 0 1 0 0 0 0 0 0
    11.198215 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     9.126959 0 0 0 0 0 0 0 0 0 0 1 0 0 0
    10.714417 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     8.987197 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.615806 0 0 0 0 0 0 0 0 0 0 0 0 0 1
     8.699514 1 0 0 0 0 0 0 0 0 0 0 0 0 0
    9.2103405 0 0 0 0 0 0 0 0 0 1 0 0 0 0
     8.699514 0 0 0 0 0 0 0 0 0 0 0 0 0 1
     12.36734 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     9.615806 0 0 0 0 0 0 0 0 0 0 0 1 0 0
     9.903487 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     10.37349 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     9.903487 0 0 0 0 0 0 0 0 0 0 0 0 0 1
     11.83138 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.903487 0 0 0 0 0 1 0 0 0 0 0 0 0 0
    10.714417 0 0 0 0 0 0 0 0 0 0 0 1 0 0
     9.230143 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     10.12663 0 0 0 0 0 0 0 0 0 0 0 0 1 0
    10.491274 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     6.907755 0 0 0 0 0 0 0 0 0 0 0 0 1 0
    10.505068 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     8.987197 0 0 0 0 0 0 0 0 0 0 0 1 0 0
     8.294049 0 0 0 0 0 0 1 0 0 0 0 0 0 0
     10.37349 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     9.392662 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     6.907755 1 0 0 0 0 0 0 0 0 0 0 0 0 0
    11.082143 0 0 0 0 0 0 0 0 0 0 0 0 1 0
    10.596635 0 0 0 1 0 0 0 0 0 0 0 0 0 0
     9.838736 0 0 0 0 0 0 0 0 0 0 0 1 0 0
     8.987197 0 0 0 0 0 0 0 0 0 0 1 0 0 0
     9.546813 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.392662 0 0 0 0 0 0 0 0 0 0 0 0 1 0
    9.2103405 0 0 0 0 0 0 0 0 0 0 1 0 0 0
      7.17012 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    11.512925 1 0 0 0 0 0 0 0 0 0 0 0 0 0
     9.615806 0 0 0 0 0 0 0 0 0 0 0 0 1 0
     9.047821 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    end
    Kindly advise on a more suitable estimation technique, if any.

    Thank you.

  • #2
    Florence:
    your data excerpt does not match your code.
    That said, it is not clear why your supervisor thinks that OLS is not the way to go in this case.
    Does her/his concern relate to standard errors? Else?
    Please help interested listers helping yoursels out. Thanks.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      In addition to Carlo's very relevant answer, please could you show a histogram of your dependent variable? And overlay a normal distribution on it? It seems like Poisson might be appropriate.

      Comment


      • #4
        Thanks Carlo and Maxence.

        Here is the code I used for the regression:

        estimates clear
        levelsof zone, local(zones)
        foreach zone of local zones{
        eststo: regress loglastpay m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 if zone==`zone'
        }
        esttab est* using urbwave3.csv

        Please concerning the histogram requested by Maxence, I have made several attempts to copy and paste here without success. Can you please guide me on how to bring the graph in here?

        Comment


        • #5
          I think I have uploaded the histogram now. Graph_histogram_loglastpay_ dependent variable _by zone_statlist.gph

          Comment


          • #6
            The regression is basically computing the mean for each m type. You could use tabstat to do that, but without the t-test for the difference on the reference group. Unless you have a reference group that is of some import, those tests will be different for each reference m.

            Not sure what the goal is here.

            Comment


            • #7
              Please I need help on how to insert the histogram here because I am not sure what I have attached can be opened.

              Thank you.

              Comment


              • #8
                Thanks George.

                The goal is getting determinants of income inequality focusing on main sector activity. In this instance, main sector activity, the predictor, is a categorical variable while the dependent variable which is last payment (income) is a continuous variable. I have used OLS to estimate the regression but I want to know if there is a more suitable technique I can use since the predictors are dummies of categorical variables.


                Comment


                • #9
                  Florence: OLS estimation of a linear model works just fine with dummy categorical variables. In fact, because you only have a set of mutually exclusive and exhaustive dummy variables, it's the same as computing the average within each category, and then taking the differences relative to the base (omitted) category. To get a percentage effect, using the log seems sensible (as the original y is always positive).

                  Comment


                  • #10
                    All the coefficients are relative to some base (excluded m type, lest you get the dummy trap). The coefficients will change as you alter the base (which one is excluded). You can test the others, with some work.

                    That is. You'll have a constant near 10, which is the base category. The other coefficients will be + or - one or so. The mean for each m-type is the 10 +- the coefficient (the constant + coefficient). When you change the base, all that will change. If you specify no constant, then you're getting the means as coefficients, but the t-stats are just m`i' = 0.

                    Do this:

                    Code:
                    reg loglastpay m* , //noconstant
                    egen groupm = group(m*)
                    tabstat loglastpay , by(groupm)
                    So, I'm curious what you are "testing"? If it's mean differences between groups, then the regression won't tell you that. It will provide the means difference between a base and the others. Change the base, and you change the tests.
                    Last edited by George Ford; 14 May 2024, 20:57.

                    Comment


                    • #11
                      Originally posted by Jeff Wooldridge View Post
                      Florence: OLS estimation of a linear model works just fine with dummy categorical variables. In fact, because you only have a set of mutually exclusive and exhaustive dummy variables, it's the same as computing the average within each category, and then taking the differences relative to the base (omitted) category. To get a percentage effect, using the log seems sensible (as the original y is always positive).
                      Thank you very much Jeff for the deeper insight into the suitability of OLS for this regression.

                      Please, how do I interpret the results (coefficients) of this regression where the dependent variable is logged?

                      Comment


                      • #12
                        Originally posted by George Ford View Post
                        All the coefficients are relative to some base (excluded m type, lest you get the dummy trap). The coefficients will change as you alter the base (which one is excluded). You can test the others, with some work.

                        That is. You'll have a constant near 10, which is the base category. The other coefficients will be + or - one or so. The mean for each m-type is the 10 +- the coefficient (the constant + coefficient). When you change the base, all that will change. If you specify no constant, then you're getting the means as coefficients, but the t-stats are just m`i' = 0.

                        Do this:

                        Code:
                        reg loglastpay m* , //noconstant
                        egen groupm = group(m*)
                        tabstat loglastpay , by(groupm)
                        So, I'm curious what you are "testing"? If it's mean differences between groups, then the regression won't tell you that. It will provide the means difference between a base and the others. Change the base, and you change the tests.
                        Thank you George. I have tried using the suggested code for 'noconstant' but it returned an error for the regression:

                        . reg loglastpay m* , //noconstant
                        option / not allowed
                        r(198);

                        Again, I opted for the manual creation of the dummy categorical variables so that I could decide myself which variable to exclude in the regression.

                        Comment


                        • #13
                          using factor variable notation still allows you to choose which category to exclude - see the section on "setting the base level" in
                          Code:
                          h fvvarlist

                          Comment


                          • #14
                            What are you trying to test?

                            Comment


                            • #15
                              drop the // before noconstant (the // comments it out).

                              Comment

                              Working...
                              X