Dear Statalisters,
I'm using Stata version 11.2. I have an issue with data dimensionality in a regression, especially wrt categorical variables. Let me explain below in detail.
I need to perform a hierarcical regression but I have a problem with the variables in the first block. The point is the parameters to be estimated are too many (22, considering my small dataset - N=423 - and given I have a lot of predictors in the second block - 14, some categorical again - and a lot of missing values) especially due to the presence of categorical variables.
My feeling is that, if I include all variables in the first block, I'm increasing the standard error too much, thus hiding relevant predictors in the 2nd block; but, if I perform a stepwise regression (for the first block again), I'm not controlling for confounders enough (dropping most of them as non-significant). Finally, if I looked for the best subset of 1st block predictors, results would depend on the choice of reference categories.
I've found a paper and some discussions online related to other software programmes, but not about Stata, about how to reduce dimensionality of categorical variables for a regression. I've tried the "mca" command and found that (despite the number of dimensions of multiple correspondance analysis should equal that of parameters, if I well understand the Stata Manual and help) it creates a number of factors equal to the number of categorical variables, thus reducing dimensionality, but I have no idea about how much information I'm missing in this way, given such factors are built without taking the outcome into account.
I wondered whether a two-stage regression (as is usally done for instrumental variables, although with a different goal here) could help (first predicting the outcome using the first block, then using fitted values in the final regression, together with the second-block variables) but I haven't found such approach anywhere (at least for continuous outcomes).
I feel somehow that there should be a solution, given I'm not interested in explaining anything or to explore any relationship, but just to partial out the effect of a bunch of covariates without increasing the standard error of the parameters of the other ones.
Thanks for your attention.
I'm using Stata version 11.2. I have an issue with data dimensionality in a regression, especially wrt categorical variables. Let me explain below in detail.
I need to perform a hierarcical regression but I have a problem with the variables in the first block. The point is the parameters to be estimated are too many (22, considering my small dataset - N=423 - and given I have a lot of predictors in the second block - 14, some categorical again - and a lot of missing values) especially due to the presence of categorical variables.
My feeling is that, if I include all variables in the first block, I'm increasing the standard error too much, thus hiding relevant predictors in the 2nd block; but, if I perform a stepwise regression (for the first block again), I'm not controlling for confounders enough (dropping most of them as non-significant). Finally, if I looked for the best subset of 1st block predictors, results would depend on the choice of reference categories.
I've found a paper and some discussions online related to other software programmes, but not about Stata, about how to reduce dimensionality of categorical variables for a regression. I've tried the "mca" command and found that (despite the number of dimensions of multiple correspondance analysis should equal that of parameters, if I well understand the Stata Manual and help) it creates a number of factors equal to the number of categorical variables, thus reducing dimensionality, but I have no idea about how much information I'm missing in this way, given such factors are built without taking the outcome into account.
I wondered whether a two-stage regression (as is usally done for instrumental variables, although with a different goal here) could help (first predicting the outcome using the first block, then using fitted values in the final regression, together with the second-block variables) but I haven't found such approach anywhere (at least for continuous outcomes).
I feel somehow that there should be a solution, given I'm not interested in explaining anything or to explore any relationship, but just to partial out the effect of a bunch of covariates without increasing the standard error of the parameters of the other ones.
Thanks for your attention.

Comment