Variable selection process in Panel data analysis

Dejan Tesic

Join Date: Aug 2018

Posts: 13
#1

Variable selection process in Panel data analysis

14 Aug 2018, 03:10

Hi everyone. As you can see this is my first post. After reading a lot of posts in this forum I still have one question or issue regarding variable selection process in panel data analysis. There hasn't been many econometric studies in my field of research so I had to start from somewhere and spent a lot of time and effort to collect data on 30 potential predictors in my panel data research. One of the research questions in my dissertation is to examine wide range of variables. My question is: how to determine which predictors are statistically significant to keep them in my model?

a) Can I do it based on their p-values? To start with 30 predictors and eliminate predictor with highest p-value, and then do the process all over again until all predictors are significant (backward elimination). From what model to start in order to get p-values: pooled, fe or re?

b) Many experienced stata users in this forum suggested that selection based on p-values is not the best solution and that selection should be done with non-inferential statistics (like correlation coefficients). Can someone please explain that process and criteria?

Every advice is highly appreciated.
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

15 Aug 2018, 12:26

There is a literature on variable selection, but most of us work within paradigms where the variables to be included are determined by our theories (and knowledge) regarding the phenomena. While different areas take different views, many of us were trained in traditions where searching for variables by estimating models is seen as not terribly productive. The arguments often emphasize the possibility of over fitting or finding idiosyncratic features of the specific sample that do not generalize to the population. This is almost diametrically opposed to the "data mining" that has become popular - when I trained data mining had a strongly negative connotation.

That said, there are many ways folks have tried to deal with research like your's. I cannot pretend to know more than a few of them. One is to try to reduce the number of variables into a small set of factors using exploratory factor analysis. This is based on the belief that many of your variables may reflect the same underlying factor. Another is to use stepwise (see the Stata documentation) to do the analysis. Stata's stepwise works on p values, but there are also stepwise approaches that emphasize explained variance or adjusted r-squares. I don't see bivariate correlations as particularly helpful - the parameters and effects in a regression with several iv's often differ in sign and magnitude from the bivariate correlation. If you have sufficient data, I would strongly encourage you to use a hold-out sample.
Comment
Dejan Tesic

Join Date: Aug 2018

Posts: 13
#3

17 Aug 2018, 07:37

Phil, thank you for advice. I researched your suggestions.

As I can see stepwise doesn't work with panel, is there a way to circumvent that? Can I get p-values from fe model and start that way. Also, I didn't find in Stata documentation that exploratory factor analysis can be done for Panel data. Maybe I'm missing something as I am pretty much unexperienced with Stata.

As for a hold-out sample, it's great suggestion, I will try to do that as soon as I find out how to do stepwise or exploratory factor analysis for panel data.

.
Comment
Shilpa Shetty

Join Date: Apr 2020

Posts: 6
#4

20 Apr 2020, 03:45

Hi Dejan. I am new to Stata and this is my first post. I have a similar question as yours. I am using xtlogit (logistic regression for panel data) for my research. I am facing difficulty in identifying a way to reduce my variables. Stata does not support stepwise for xtlogit and I am not sure how factor analysis can be performed for panel data. I have a balanced panel.

What is the solution that you found to this issue? Kindly guide.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

20 Apr 2020, 06:31

You didn't say how many variables are inside the data. That said, I'm not fond of p-value-selected variables, so to speak. In health sciences (which may not be your case), I tend to follow this approach: first, there are variables we shall never get rid of, such as age, sex, etc. Then, there are variables related to the study question itselt, say, if I'm dealing with echocardiographic topics, some echocardiographic parameters will surely give a step forward. Then, I check the literature. Out of exploratory curiosity, maybe some variables could be inspected as well... Finally, depending on the number of observations and (always) wishing to rely on the parsimony principle, I may test whether a few additional variables would improve the model. That is it. By the way, I tend to avoid models with too many variables, more so if it is possible to "explain" it with fewer ones.

Best regards,

Marcos
Comment

Announcement

Variable selection process in Panel data analysis

Comment

Comment

Comment

Comment