Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Predict, xb command -- Dummy Variable for Year

    Hi!

    I am using Stata/MP from a Mac.

    I am trying to use database A to predict the cost of healthcare in database B. Database B has the same demographic / health status characteristics as database A, but it doesn't have the "cost" variable.
    Also:
    In database A, the years available are 1996-2011.
    In database B, the years available are 2008-2011.

    In database A, I use the following regression:
    quietly tab year, gen (d_year)
    areg cost age sex race disease [aweight= perwt], absorb (duid) robust cluster (duid) Then in database B, I use the comand:
    predict cost, xb

    I get the error message "d_year1 d_year2 ect" not found. This makes sense since I do not have the same years in database A and B.
    Do you know any way around this?

    Thanks in advance!



  • #2
    I'm confused by your post. It appears you start with database A, and generate some indicator variables d_year* for years there. Then you run a regression, though based on what you posted, it does not use these d_year* variables. Then you switch to database B in the hopes of doing out-of-sample predictions?

    If your regression actually involved the d_year* variables, I could see why Stata would complain that it can't find d_year* because you would need to generate them anew in database B: once you load in database B, nothing you did in database A persists. So that would appear to be the solution to your problem. But since your regression actually doesn't include those d_year* variables, I don't see why Stata would complain about not being able to find d_year* when asked to run -predict-, as they would play no role in that.

    So I suspect your situation is not exactly the way you have described it. To clarify it, please provide a sample of your data in both databases, the results of -describe- for datasets A and B, the exact code you ran (copied and pasted from Stata, not re-typed or paraphrased), and the exact output that Stata gave you from it (again, copied and pasted, not re-typed or paraphrased).

    Comment


    • #3
      Aren't you going to have to create values of "d_year1 d_year2" etc. in database B? What if you simply generate d_year1 = 0 (and so on for all of the dummy variables corresponding to each year between 1996 and 2007)? Any regression coefficient multiplied by zero equals zero.

      Comment


      • #4
        Hi Clyde,

        Thanks for your answer!

        Sorry, I did re-type the and forget the "d_year*" in the process. My bad!
        Yes, I'm trying to do out-of-sample predictions.

        So here is the actualy code: From Database A:
        quietly tab year, gen (d_year)
        areg total_cost age male black ind_alsk asian other_race ac_haw wagep wlklim coglim hearng actlim seedif d_year* [aweight= perwt], absorb (duid) robust cluster (duid)

        From Databse B:
        predict total_cost, xb

        The error message:
        variable d_year1 not found
        variable d_year2 not found
        variable d_year3 not found
        variable d_year4 not found
        variable d_year5 not found
        variable d_year6 not found
        variable d_year7 not found
        variable d_year8 not found
        variable d_year9 not found
        variable d_year10 not found
        variable d_year11 not found
        variable d_year12 not found
        variable d_year13 not found
        variable d_year14 not found
        variable d_year15 not found
        variable d_year16 not found
        I'm actually running a do file right now, so I can't send the -describe- commands, but I will do so ASAP!

        Thanks!

        Comment


        • #5
          Thanks a lot for your advice, Stephen. I'm pretty sure this will work!

          Comment


          • #6
            Just to be sure, should I do this?

            In database B:
            quietly tab year, gen (d_year)
            => I will get d_year1 - d_year4 corresponding to the years 2008-2011

            Then generate d_year5 - d_year16 for 1996-2007?

            Or should I follow the order from database A and do:
            gen d_year1 = 0
            gen d_year2 = 0
            ...
            gen d_year12 = 0
            gen d_year13 = (year == 2008)
            gen d_year14 = (year == 2009)
            etc ?

            Sorry, this might be a silly question!

            Comment


            • #7
              your question is not clear; however, the answer is clear: you want your data in "B" to be defined in exactly the same was as your data in "A"

              Comment


              • #8
                Thanks for your help!

                Comment


                • #9
                  Ana could use factor variable notation without having to generate the indicator variables.

                  The command
                  Code:
                  areg total_cost age male black ind_alsk asian other_race ac_haw wagep wlklim coglim hearng actlim seedif d_year* [aweight= perwt], absorb (duid) robust cluster (duid)
                  with factor variable notation would be
                  Code:
                  areg total_cost age male black ind_alsk asian other_race ac_haw wagep wlklim coglim hearng actlim seedif i.year [aweight= perwt], absorb(duid) vce(cluster duid)
                  I changed d_year* to i.year. This factor variable notation was introduced in Stata 11.

                  I also changed robust cluster(duid) to the modern vce(cluster duid). The vce()
                  option replaced options robust and cluster() in Stata 10.

                  Comment

                  Working...
                  X