I am currently totally new to Stata, and it is right now for me a supplementary tool. I find it well-documented and perfect for some tasks.
But I cannot fully understand it’s mechanism of treatment of categorical variables.
It can be best described by example. Suppose I have the next dataset:
+---------------------+
| category contin~s |
|---------------------|
1. | Cat A 5 |
2. | Cat B 25 |
3. | Cat C 34 |
4. | Cat A 6 |
5. | Cat B 22 |
|---------------------|
6. | Cat C 35 |
This is very simple data with variables, one categorical and one continuous. Assume that one wants to run linear regression on it (‘continuous’ here DV, ‘category’ is IV). According to Stata’s basic workflow, we should first declare ‘category’ as labeled numeric variable to run regressions and etc. on it.
encode category, generate(category_i)
Then:
regress continuos i.category_i
And we have our result. Assume we are satisfied with our model, and we want to save that for future predictions (pretty natural thing to do):
estimates save linearmod1
Everything looks fine at the moment. Then we get another data, on which we want to predict variable ‘continuous’, that looks like follows:
+----------+
| category |
|----------|
1. | Cat A |
2. | Cat A |
3. | Cat C |
+----------+
Now the crucial step: in order to predict on this data, we should transform ‘category’ into labeled numeric again. After doing that (with same name of the labeled numeric variable ‘category_i’), we predict results and store them in var ‘prediction’:
estimates use linearmod1
predict prediction
+--------------------------------+
| category catego~i predic~n |
|--------------------------------|
1. | Cat A Cat A 5.5 |
2. | Cat A Cat A 5.5 |
3. | Cat C Cat C 23.5 |
And we get obviously wrong prediction for category C!
The reason is simple: Cat C was encoded in first dataset as dummy integer 3, but our new dataset, where we have no Cat B, it is encoded as dummy integer 2. Model is basically generating predictions for Cat B, not for Cat C.
How can one resolve that? Of course, we can manually change labels and enter integer 3 in this simple case. But in real world it will be a nightmare to check if everything was properly encoded in a new dataset, with proper dummy variables, that are consistent with our saved models. How can I establish a reliable workflow for ‘manage training data’ -> ‘build a model’ -> ‘store model’ -> ‘generate predictions for a new data, where categories in categorical variables can be omitted’, without manually checking the proper encoding of categorical variables?
Hope my question was clear. Any questions and feedback will be highly appreciated.
But I cannot fully understand it’s mechanism of treatment of categorical variables.
It can be best described by example. Suppose I have the next dataset:
+---------------------+
| category contin~s |
|---------------------|
1. | Cat A 5 |
2. | Cat B 25 |
3. | Cat C 34 |
4. | Cat A 6 |
5. | Cat B 22 |
|---------------------|
6. | Cat C 35 |
This is very simple data with variables, one categorical and one continuous. Assume that one wants to run linear regression on it (‘continuous’ here DV, ‘category’ is IV). According to Stata’s basic workflow, we should first declare ‘category’ as labeled numeric variable to run regressions and etc. on it.
encode category, generate(category_i)
Then:
regress continuos i.category_i
And we have our result. Assume we are satisfied with our model, and we want to save that for future predictions (pretty natural thing to do):
estimates save linearmod1
Everything looks fine at the moment. Then we get another data, on which we want to predict variable ‘continuous’, that looks like follows:
+----------+
| category |
|----------|
1. | Cat A |
2. | Cat A |
3. | Cat C |
+----------+
Now the crucial step: in order to predict on this data, we should transform ‘category’ into labeled numeric again. After doing that (with same name of the labeled numeric variable ‘category_i’), we predict results and store them in var ‘prediction’:
estimates use linearmod1
predict prediction
+--------------------------------+
| category catego~i predic~n |
|--------------------------------|
1. | Cat A Cat A 5.5 |
2. | Cat A Cat A 5.5 |
3. | Cat C Cat C 23.5 |
And we get obviously wrong prediction for category C!
The reason is simple: Cat C was encoded in first dataset as dummy integer 3, but our new dataset, where we have no Cat B, it is encoded as dummy integer 2. Model is basically generating predictions for Cat B, not for Cat C.
How can one resolve that? Of course, we can manually change labels and enter integer 3 in this simple case. But in real world it will be a nightmare to check if everything was properly encoded in a new dataset, with proper dummy variables, that are consistent with our saved models. How can I establish a reliable workflow for ‘manage training data’ -> ‘build a model’ -> ‘store model’ -> ‘generate predictions for a new data, where categories in categorical variables can be omitted’, without manually checking the proper encoding of categorical variables?
Hope my question was clear. Any questions and feedback will be highly appreciated.
Comment