Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need Help for Regression with Both Categorical and Numeric Variables

    I am trying to run a regression with various types of variables and I'm not sure its possible. I would really appreciate it if someone could point me in the right direction.

    I am trying to predict SalePrice of a car (numeric)
    And the predictors are Year (numeric), Metro (categorical), LINDEX, MonthofYear (categorical , but written as numbers), HaveTitle (numeric), Mileage (numeric), Running (numeric), and MC (categorical but written with numbers)

    I am getting a result of "no observations"


    . regress PaidAmount Year Metro LINDEX MonthofYear HaveTitle Mileage Running MC
    no observations
    r(2000);



    I know that some re-coding will need to be redone as STATA won't know which of my variables that are categorical are written as numbers, etc. But I don't know how to move forward. Any assistance would be greatly appreciated. Thanks!

  • #2
    Sarah:
    are you sure that no string variables are included in your regression model?
    That said, posting an excerpt/example of your dataset via -dataex- can increase your chances of getting helpful replies.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Carlo,
      Thanks for your reply. There are string variables included...

      Comment


      • #4
        Here is more information on my variables


        Contains data
        obs: 66,738
        vars: 19
        size: 9,476,796
        -----------------------------------------------------------------------------------------------------------
        storage display value
        variable name type format label variable label
        -----------------------------------------------------------------------------------------------------------
        KADON_ID long %10.0g KADON_ID
        VehicleType str11 %11s VehicleType
        Year int %10.0g Year
        Make str17 %17s Make
        Model str25 %25s Model
        PickupZip long %10.0g PickupZip
        Metro str3 %9s Metro
        PaidAmount double %10.0g PaidAmount
        DonationMonth str7 %9s DonationMonth
        SellerType str5 %9s SellerType
        INDEX double %10.0g INDEX
        LINDEX double %10.0g LINDEX
        SaleMonth str7 %9s SaleMonth
        MonthofYear byte %10.0g Month of Year
        SaleDate double %td.. SaleDate
        HaveTitle double %10.0g HaveTitle
        Mileage long %10.0g Mileage
        Running double %10.0g Running

        Comment


        • #5
          Metro is string and that alone is fatal. It should be numeric somehow. To suggest "somehow", we would need to know its distinct values.

          The results above don't include MC at all so far as I can see but you evidently didn't copy and paste all your results as only 18 variables are mentioned.

          You should be able to check what you post!

          Comment


          • #6
            Just to add to Carlo & Nick's excellent advice, after converting her string variables to a numeric representation (probably using -encode-, though without seeing the values of the string variables it's not entirely certain), Sarah will want to learn about factor variable notation (-help fvvarlist-) to see how to specify her categorical variables in her regression command.

            Comment


            • #7
              Thank you Nick and Clyde.

              The values of Metro are the 20 largest metro areas in the US. What ideally I would like to say is "All things being equal, being in NYC metro area adds $5 to sale price", etc.
              I know that having a title adds $x to sale price, having an extra hundred miles on the odometer takes off $x, and I would like a number that being in a certain Metro adds or subtracts from sale price. Similarly with car models (MC). I have the top 26 car models (in my data set) and want to know "The car being a camry adds $x to the value" etc.


              . tabulate Metro

              Metro | Freq. Percent Cum.
              ------------+-----------------------------------
              ATL | 662 0.99 0.99
              BOS | 7,869 11.79 12.78
              CHA | 148 0.22 13.00
              CHI | 4,899 7.34 20.35
              DAL | 1,421 2.13 22.47
              DCB | 6,147 9.21 31.69
              DEN | 639 0.96 32.64
              DET | 279 0.42 33.06
              HOU | 1,369 2.05 35.11
              LA | 3,878 5.81 40.92
              MIA | 494 0.74 41.66
              MIN | 4,036 6.05 47.71
              NYC | 24,067 36.06 83.77
              PHI | 939 1.41 85.18
              PHO | 565 0.85 86.03
              SEA | 1,678 2.51 88.54
              SFC | 6,840 10.25 98.79
              SND | 328 0.49 99.28
              STL | 289 0.43 99.71
              TAM | 191 0.29 100.00
              ------------+-----------------------------------
              Total | 66,738 100.00


              For MC, 1 represents the top car , 2 represents the next car, etc. However, it really should be categorical because as I said Camry would be the most popular car but not necessarily the most valuable.


              . tabulate MC

              MC | Freq. Percent Cum.
              ------------+-----------------------------------
              1 | 3,189 11.01 11.01
              2 | 2,973 10.26 21.27
              3 | 2,758 9.52 30.79
              4 | 2,154 7.43 38.22
              5 | 1,426 4.92 43.14
              6 | 1,381 4.77 47.91
              7 | 1,281 4.42 52.33
              8 | 1,127 3.89 56.22
              9 | 1,030 3.55 59.77
              10 | 993 3.43 63.20
              11 | 844 2.91 66.11
              12 | 810 2.80 68.91
              13 | 770 2.66 71.57
              14 | 733 2.53 74.09
              15 | 713 2.46 76.56
              16 | 697 2.41 78.96
              17 | 687 2.37 81.33
              18 | 676 2.33 83.67
              19 | 670 2.31 85.98
              20 | 653 2.25 88.23
              21 | 637 2.20 90.43
              22 | 627 2.16 92.59
              23 | 586 2.02 94.62
              24 | 540 1.86 96.48
              25 | 519 1.79 98.27
              26 | 501 1.73 100.00
              ------------+-----------------------------------




              Comment


              • #8
                Sarah:
                as far as your MC regression is concerned, can't you make things clearer by providing an example using -auto.dta- included amond Stata datasets? Thanks.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  I'm sorry Carlo, I don't understand what you are asking for (how i would use a sample data set to explain)

                  Basically I am looking to make a formula that would be something like

                  SALEPRICE= Year(a) + Metro(b) + LINDEX(c) + MonthofYear(d) + HaveTitle(e) + Mileage(f) + Running (g) + MC(h)

                  But I know that my variables are not formatted correctly to do that and not sure how to adjust. Metro right now is listed as purely categorical as you can see above, adn I could easily change it to numeric but want STATA to recognize it as categorical regardless. I am not sure if this is all possible.

                  Comment


                  • #10
                    Hello Sarah,

                    Considering all variables are numerical, you may also reflect about listwise deletion, shall the model present too many missing values.

                    That being taken aside, and now moving on to this sentence

                    Similarly with car models (MC). I have the top 26 car models (in my data set) and want to know "The car being a camry adds $x to the value" etc.

                    maybe what you wish with the variable MC can be accomplished by using the "if" clause.

                    Best,

                    Marcos
                    Last edited by Marcos Almeida; 06 Jan 2017, 08:13.
                    Best regards,

                    Marcos

                    Comment


                    • #11
                      On the information so far,

                      Metro should be encoded

                      Both MC and Metro should be specified to a suitable model-fitting command using factor variable notation.

                      The overall goal was stated in #1 to be predicting SalePrice but then PaidAmount. is the variable mentioned. I can't see any reason to prefer regress over poisson for a response variable of that kind. http://blog.stata.com/2011/08/22/use...tell-a-friend/

                      Comment


                      • #12
                        Sorry Nick, PaidAmount is the same thing as SalePrice. I made a mistake- in a previous project I had it named SalePrice and here it is named PaidAmount.

                        Comment


                        • #13
                          I've never used the encode function, but going to do that now. Thanks

                          Comment


                          • #14
                            Sarah: Your doubts in #9 were already answered by Clyde in #6 and he advised you what to read up on.

                            encode is a command, not a function. In Stata the two are quite different.

                            Comment


                            • #15
                              Thank you all.
                              I encoded Metro. Nick you are right, I will most likely end up using Poisson. I am going to do a few different regressions and use whichever one is the best fit.

                              Now when I do the linear regression, I actually get an equation (instead of an error message). However, I am not confident that STATA realizes that "MC" is categorical... any suggestions on how to move forward with that?

                              Obviously my STATA skills are a bit rusty so I really appreciate all of your patience and help. I know that none of you cna relate to this (based on your names), but I am also 9 months pregnant, so please be understanding that I sound a little in a fog!

                              Comment

                              Working...
                              X