FInd the right model

Jean Kellerberg

Join Date: Jan 2018

Posts: 3
#1

FInd the right model

11 Jan 2018, 10:20

Hello guys,

Few weeks ago I started my first econometrics project at my university where I want to measure "The (medium term) Effect of Smoking Cessation on Bodyweight/BMI" by using a panel data set.
My biggest problem right now is that I am not sure how to identify the effect that I am interested in although I spent so much time researching and looking for a fitting model on the internet.
The data set goes from 2002 to 2012 where people answer the question about smoking/bodyweight every two years. Of course only a small sample of the participants take part in all years.

Firstly I tried the model that my prof suggested me:

Code:

reg weight smoking $vars if l2.smoking == 1, r cluster(id) reg weightt+2 smoking $vars if l2.smoking == 1, r cluster(id) and so on...

But the estimated value seemed to be too small and wrong: in t+4 the value descreased, which shouldnt happen I guess. Moreover I am not sure how the model deals with people who started to smoke once again in the years between l2.smoking and t+4.

Then I tried two models with a) a dummy variable for the last year where the people smoked or b) a dummy variable for the first year of the cessation ... With these dummies I had no concern about the problem form the first model where I was not sure if the people started to smoke once again

Code:

reg weight dummy $vars, r cluster(id) reg weightt+2 dummy $vars, r cluster(id) but also reg dummy $vars, r cluster(id)

Also tried to include i.year and even tried fixed effects, but once again the values were small and sometimes far away to be significant.

My last idea was the Difference-in-Difference method with Always Smokers as controll group and the people who stopped smoking in 2006 or 2004 as the treatment group. I am aware that "treatment" is not exogenous but the results looked good.

Since I found myself often on this website, especially because I havent touched Stata ever before I would ask the users here if they have an idea for me. I already asked a similiar question on another statistics forum but they suggested me a very different non-regression approach, which I have no idea about. Moreover the models do not have to the "perfect" because this is my first project in this field.

Thanks !

Last edited by Jean Kellerberg; 11 Jan 2018, 10:29. Reason: spelling
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17703
#2

11 Jan 2018, 10:50

Jean:
first of all I'm wondering why you remain on -regress- when yoiu can use -xtreg- to analyze a panel dataset.
Some comments about your post:
- the reduction of -weight- in the last wave of data might be due to a monotonic pattern of missing data (see -mi-. glossary in Stata .pdf manual). In the light of that you should probably devote some hours to deal with misssing data (again, -mi- entries, including their reference sections, in Stata .pdf manual are a good place to start);
- whenever it comes to analyze soimething, it's very likely that somebody else was presented in the past with the same research topic and have published something. Skimming through the literature in your research field can give you some clues/hints about a regression model that has a good chance to be accepted by reviewers/teachers/supervisor (by the way: what's the opinion of your professor of econometrics about the whole matter?);
-preferring "the most" statistical significant model vs the one that gives a fair and true view of the data generating process is not an approach that I would advise about.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jean Kellerberg

Join Date: Jan 2018

Posts: 3
#3

13 Jan 2018, 12:15

Hey Carlo and other people,

I already read about 20-30 papers but luckily I found a very interesting and "easy" model this weekend.

So I am using the balanced panel data from 2002 to 2012, with 2 years gaps .. with two groups of people: people who smoked in all years and people who smoked from 2002 to 2006 and then stopped
Lets just assume that the data is right and the people dont smoke between the years, I am not sure how the question about the smoking looked like

I do 3 regressions
1) only with years 2006 and 2012
2) only with years 2006 and 2010
3) only with years 2006 and 2008
all other years are dropped in the tree regressions
The paper used only one regression, while I am, as said in my first post interessted in the t+2 t+4 effects

Now I do simple Fixed Effects (First Differences has the same results), Random Effects and then the Hausmann test
The value of smoking should show me the effect of the cessation since we compare people who smoked and become non smokers to alwayssmokers

Code:

xtreg weight smoking $variables_like_in_the_paper, fe estimates store fixed1 xtreg weight smoking $variables_like_in_the_paper, re estimates store random1 hausman fixed1 random1

All coefficients for smoking are highly significant and in the area I expected them to be.
Only thing that bothers me a little bit is the small R^2 overall = 0.0013

What do you think about it, I guess this could be right model I was looking for

The paper is called
Smoking Cessation and Changes in Body Mass Index Among Middle Aged and Older Adults, 2016 by Andy Sharma
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17703
#4

14 Jan 2018, 07:20

Jean:
methodologically speaking, I would not support deleting years, in that you may end up with a (biased) sample which only tenously representative of the starting one.
That said, I would consider dealing with missing data (especially if the missingness is informative) instead of analyizing the complete cases only.
As far as the assumed low overall R-sq is concerned, please note that -fe- specification is expected to maximize the within R-sq, whereas -re- specification is expected to maximize the between R-sq.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

FInd the right model

Comment

Comment

Comment