Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variable selection for ordered dependent variable in panel data

    Hello,

    I am currently thinking about my variable selection process that I will need to do later down the road in my research.

    I am working with a robust panel data set consisting of 19 000 individuals, observed in 9 waves, there are around 500 variables available for each individual in each time period.

    My main focus will be an ordered variable (5 levels). What are my options for the variable selection process here?

    I wanted to use lasso/elastic net, but that is not possible because they are for linear models but my focus will be an ordered variable.

    My colleague suggested to first estimate a panel ordered probit, then to use the predict command to obtain the fitted values and then to use the lasso/elastic net with the fitted values as the dependent variables.
    But I had trouble finding this approach elsewhere.

    Are there any other methods that could be executed in stata in my case? Or should I resort to my econometric sense and just select a sensible pool of variables using trial and error?

    Any help will be greatly appreciated.

  • #2
    Josef:
    welcome to this forum-
    I'd take a look at the literature in your research field to select the predictors to be included in your regression.
    Stepwise procedures, if you're considering them as possibe tools, come with some issues (https://www.stata.com/support/faqs/s...ssion-problems).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      I wanted to use lasso/elastic net, but that is not possible because they are for linear models but my focus will be an ordered variable.
      You can use lasso to search for an optimal ordered logistic regression model. See pologit and xpologit in the user manual. https://www.stata.com/manuals/lasso.pdf

      I think the tricky part is accounting for the panel design, since - from what I can tell looking over the documentation - there isn't an xt style lasso for panel data. I think you can overcome this by organizing your data in long format, specifying the vce(cluster clustvar) option, then replacing clustvar with whichever variable indicates individuals. This should give you a model with waves nested within individuals.

      Or should I resort to my econometric sense and just select a sensible pool of variables using trial and error?
      You say this like it's problematic to resort to making deductions about the data based on your theoretical understanding of the underlying phenomena, but even assuming you have measures of every component of the "true" data generation process in your dataset, lasso still "does not select the covariates of the true model with probability 1" (page 13 of the manual).

      Comment

      Working...
      X