Hi,
I am using a census dataset and ran OLS regressions with a binary dependent variable ("employed") and binary explanatory variable (yob_1919_age; equal to one if respondent was born in 1919) as well as two continuous explanatory variables (yob_age, yob2_age; these are the year of birth and the year of birth squared; please ignore the "_age" part in the variable names which doesn't mean anything).
I found that the standard errors and p-values differ quite a lot depending on whether the dataset is sorted by age in ascending or in descending order. Please see a screenshot of the regression output below. This only happens when using the option "robust".
The F statistic is missing, which could indicate that there is something wrong with the model. I limited the regression to those born from 1912 to 1919 and my guess is that the inclusion of a 1919 dummy and a binary outcome means that there is insufficient variation in one of the cells. However, I am still very surprised that the result depends on how the data is sorted (rather than simply having dots for the standard errors). I'd be thankful for any hint what may be causing this.
I'm using Stata 16.1 on a Macbook but I could replicate this problem on a Windows PC.
Best regards,
Christian

I am using a census dataset and ran OLS regressions with a binary dependent variable ("employed") and binary explanatory variable (yob_1919_age; equal to one if respondent was born in 1919) as well as two continuous explanatory variables (yob_age, yob2_age; these are the year of birth and the year of birth squared; please ignore the "_age" part in the variable names which doesn't mean anything).
I found that the standard errors and p-values differ quite a lot depending on whether the dataset is sorted by age in ascending or in descending order. Please see a screenshot of the regression output below. This only happens when using the option "robust".
The F statistic is missing, which could indicate that there is something wrong with the model. I limited the regression to those born from 1912 to 1919 and my guess is that the inclusion of a 1919 dummy and a binary outcome means that there is insufficient variation in one of the cells. However, I am still very surprised that the result depends on how the data is sorted (rather than simply having dots for the standard errors). I'd be thankful for any hint what may be causing this.
I'm using Stata 16.1 on a Macbook but I could replicate this problem on a Windows PC.
Best regards,
Christian
Comment