How to control for multiple categorical variables (with multiple levels) without overadjusting

Federico Tedeschi

Join Date: Mar 2015

Posts: 139
#1

How to control for multiple categorical variables (with multiple levels) without overadjusting

08 Oct 2015, 14:06

Dear Statalisters,

I'm using Stata version 11.2. I have an issue with data dimensionality in a regression, especially wrt categorical variables. Let me explain below in detail.

I need to perform a hierarcical regression but I have a problem with the variables in the first block. The point is the parameters to be estimated are too many (22, considering my small dataset - N=423 - and given I have a lot of predictors in the second block - 14, some categorical again - and a lot of missing values) especially due to the presence of categorical variables.
My feeling is that, if I include all variables in the first block, I'm increasing the standard error too much, thus hiding relevant predictors in the 2nd block; but, if I perform a stepwise regression (for the first block again), I'm not controlling for confounders enough (dropping most of them as non-significant). Finally, if I looked for the best subset of 1st block predictors, results would depend on the choice of reference categories.

I've found a paper and some discussions online related to other software programmes, but not about Stata, about how to reduce dimensionality of categorical variables for a regression. I've tried the "mca" command and found that (despite the number of dimensions of multiple correspondance analysis should equal that of parameters, if I well understand the Stata Manual and help) it creates a number of factors equal to the number of categorical variables, thus reducing dimensionality, but I have no idea about how much information I'm missing in this way, given such factors are built without taking the outcome into account.

I wondered whether a two-stage regression (as is usally done for instrumental variables, although with a different goal here) could help (first predicting the outcome using the first block, then using fitted values in the final regression, together with the second-block variables) but I haven't found such approach anywhere (at least for continuous outcomes).

I feel somehow that there should be a solution, given I'm not interested in explaining anything or to explore any relationship, but just to partial out the effect of a bunch of covariates without increasing the standard error of the parameters of the other ones.

Thanks for your attention.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17850
#2

08 Oct 2015, 22:43

Federico:
you seem to have far too many predictors for your sample.
Maybe something like -pca- can help you in data reduction.

Kind regards,
Carlo
(Stata 19.0)
Comment
Federico Tedeschi

Join Date: Mar 2015

Posts: 139
#3

09 Oct 2015, 02:53

Thank you, Carlo.

That's what I've done indeed: I've used the command - mca - because I have 9 categorical variables (I'd like to perform some methods that could account for both categorical and continuous variables, but haven't found one in Stata). However this led to the creation of 9 continuous variables, but I'm worried about the loss of information. In fact, the factors are built totally disregarding the outcome variable. For example, in my case I found the dimensions more related to outcome were the 1st, the 4th and the 9th, thus there's nothing telling me that the information I'm discarding has little to do with the outcome.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17850
#4

09 Oct 2015, 04:23

Federico:
with so many predictors there's nothing else you can do than trying to reduce the complexity.
Does the literature in your research field vouch the dimensions 1st, 4th and 9th as far as your research goal is concerned?

Kind regards,
Carlo
(Stata 19.0)
Comment
Federico Tedeschi

Join Date: Mar 2015

Posts: 139
#5

09 Oct 2015, 04:32

Thanks for your reply. I agree with you I have to reduce complexity: the problem is how to do it without losing information.

Maybe there was a misunderstanding: when I talked about 1st, 4th and 9th dimensions I meant the equivalent of the factors for PCA, i.e. the variables created by MCA, thus without any interpretation. I talked about them just to say that, if I had taken the first 3 dimensions from MCA, I'd have missed a lot of information relevant to the outcome. As for the variables in the 1st block, they include sociodemographic variables that are commonly controlled for in the literature, as well as information on the working role and history that must be controlled for since the outcome is about problems that do not affect all workers in the same way.
I forgot to add that most of my confounders are ordinal, so I cannot test for linearity of the effects to make them numerical.

Last edited by Federico Tedeschi; 09 Oct 2015, 04:38.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17850
#6

09 Oct 2015, 04:39

Federico:
no, you were clear. I replicated the same names for the dimensions you mentioned hoping they were meaningful for you.
I would take a look at -pca postestimation-, especially the Example section, hoping it can give you some guidance.

Kind regards,
Carlo
(Stata 19.0)
Comment
Federico Tedeschi

Join Date: Mar 2015

Posts: 139
#7

09 Oct 2015, 07:15

Thank you Carlo.
I'll take a look at - mca postestimation - and ask a new question on multiplce correspondance analysis in case.

Federico
Comment

Announcement

How to control for multiple categorical variables (with multiple levels) without overadjusting

Comment

Comment

Comment

Comment

Comment

Comment