Predict, xb command -- Dummy Variable for Year

Ana Carla

Join Date: Aug 2014

Posts: 39
#1

Predict, xb command -- Dummy Variable for Year

19 Aug 2014, 09:08

Hi!

I am using Stata/MP from a Mac.

I am trying to use database A to predict the cost of healthcare in database B. Database B has the same demographic / health status characteristics as database A, but it doesn't have the "cost" variable.
Also:
In database A, the years available are 1996-2011.
In database B, the years available are 2008-2011.

In database A, I use the following regression:
quietly tab year, gen (d_year)
areg cost age sex race disease [aweight= perwt], absorb (duid) robust cluster (duid) Then in database B, I use the comand:
predict cost, xb

I get the error message "d_year1 d_year2 ect" not found. This makes sense since I do not have the same years in database A and B.
Do you know any way around this?

Thanks in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

19 Aug 2014, 09:26

I'm confused by your post. It appears you start with database A, and generate some indicator variables d_year* for years there. Then you run a regression, though based on what you posted, it does not use these d_year* variables. Then you switch to database B in the hopes of doing out-of-sample predictions?

If your regression actually involved the d_year* variables, I could see why Stata would complain that it can't find d_year* because you would need to generate them anew in database B: once you load in database B, nothing you did in database A persists. So that would appear to be the solution to your problem. But since your regression actually doesn't include those d_year* variables, I don't see why Stata would complain about not being able to find d_year* when asked to run -predict-, as they would play no role in that.

So I suspect your situation is not exactly the way you have described it. To clarify it, please provide a sample of your data in both databases, the results of -describe- for datasets A and B, the exact code you ran (copied and pasted from Stata, not re-typed or paraphrased), and the exact output that Stata gave you from it (again, copied and pasted, not re-typed or paraphrased).
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#3

19 Aug 2014, 09:31

Aren't you going to have to create values of "d_year1 d_year2" etc. in database B? What if you simply generate d_year1 = 0 (and so on for all of the dummy variables corresponding to each year between 1996 and 2007)? Any regression coefficient multiplied by zero equals zero.
Comment
Ana Carla

Join Date: Aug 2014

Posts: 39
#4

19 Aug 2014, 09:36

Hi Clyde,

Thanks for your answer!

Sorry, I did re-type the and forget the "d_year*" in the process. My bad!
Yes, I'm trying to do out-of-sample predictions.

So here is the actualy code: From Database A:
quietly tab year, gen (d_year)
areg total_cost age male black ind_alsk asian other_race ac_haw wagep wlklim coglim hearng actlim seedif d_year* [aweight= perwt], absorb (duid) robust cluster (duid)

From Databse B:
predict total_cost, xb

The error message:
variable d_year1 not found
variable d_year2 not found
variable d_year3 not found
variable d_year4 not found
variable d_year5 not found
variable d_year6 not found
variable d_year7 not found
variable d_year8 not found
variable d_year9 not found
variable d_year10 not found
variable d_year11 not found
variable d_year12 not found
variable d_year13 not found
variable d_year14 not found
variable d_year15 not found
variable d_year16 not found
I'm actually running a do file right now, so I can't send the -describe- commands, but I will do so ASAP!

Thanks!
Comment
Ana Carla

Join Date: Aug 2014

Posts: 39
#5

19 Aug 2014, 09:38

Thanks a lot for your advice, Stephen. I'm pretty sure this will work!
Comment
Ana Carla

Join Date: Aug 2014

Posts: 39
#6

19 Aug 2014, 09:53

Just to be sure, should I do this?

In database B:
quietly tab year, gen (d_year)
=> I will get d_year1 - d_year4 corresponding to the years 2008-2011

Then generate d_year5 - d_year16 for 1996-2007?

Or should I follow the order from database A and do:
gen d_year1 = 0
gen d_year2 = 0
...
gen d_year12 = 0
gen d_year13 = (year == 2008)
gen d_year14 = (year == 2009)
etc ?

Sorry, this might be a silly question!
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#7

19 Aug 2014, 09:56

your question is not clear; however, the answer is clear: you want your data in "B" to be defined in exactly the same was as your data in "A"
Comment
Ana Carla

Join Date: Aug 2014

Posts: 39
#8

19 Aug 2014, 10:19

Thanks for your help!
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 700
#9

19 Aug 2014, 14:53

Ana could use factor variable notation without having to generate the indicator variables.

The command

Code:

areg total_cost age male black ind_alsk asian other_race ac_haw wagep wlklim coglim hearng actlim seedif d_year* [aweight= perwt], absorb (duid) robust cluster (duid)

with factor variable notation would be

Code:

areg total_cost age male black ind_alsk asian other_race ac_haw wagep wlklim coglim hearng actlim seedif i.year [aweight= perwt], absorb(duid) vce(cluster duid)

I changed d_year* to i.year. This factor variable notation was introduced in Stata 11.

I also changed robust cluster(duid) to the modern vce(cluster duid). The vce()
option replaced options robust and cluster() in Stata 10.
Comment

Announcement

Predict, xb command -- Dummy Variable for Year

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment