How to force Stata to use identical dummy variables for a new dataset?

Stanislav Martsen

Join Date: Jan 2017

Posts: 5
#1

How to force Stata to use identical dummy variables for a new dataset?

30 Jan 2017, 08:53

I am currently totally new to Stata, and it is right now for me a supplementary tool. I find it well-documented and perfect for some tasks.
But I cannot fully understand it’s mechanism of treatment of categorical variables.

It can be best described by example. Suppose I have the next dataset:

+---------------------+
| category contin~s |
|---------------------|
1. | Cat A 5 |
2. | Cat B 25 |
3. | Cat C 34 |
4. | Cat A 6 |
5. | Cat B 22 |
|---------------------|
6. | Cat C 35 |

This is very simple data with variables, one categorical and one continuous. Assume that one wants to run linear regression on it (‘continuous’ here DV, ‘category’ is IV). According to Stata’s basic workflow, we should first declare ‘category’ as labeled numeric variable to run regressions and etc. on it.

encode category, generate(category_i)

Then:

regress continuos i.category_i

And we have our result. Assume we are satisfied with our model, and we want to save that for future predictions (pretty natural thing to do):

estimates save linearmod1

Everything looks fine at the moment. Then we get another data, on which we want to predict variable ‘continuous’, that looks like follows:

+----------+
| category |
|----------|
1. | Cat A |
2. | Cat A |
3. | Cat C |
+----------+

Now the crucial step: in order to predict on this data, we should transform ‘category’ into labeled numeric again. After doing that (with same name of the labeled numeric variable ‘category_i’), we predict results and store them in var ‘prediction’:

estimates use linearmod1

predict prediction

+--------------------------------+
| category catego~i predic~n |
|--------------------------------|
1. | Cat A Cat A 5.5 |
2. | Cat A Cat A 5.5 |
3. | Cat C Cat C 23.5 |

And we get obviously wrong prediction for category C!

The reason is simple: Cat C was encoded in first dataset as dummy integer 3, but our new dataset, where we have no Cat B, it is encoded as dummy integer 2. Model is basically generating predictions for Cat B, not for Cat C.

How can one resolve that? Of course, we can manually change labels and enter integer 3 in this simple case. But in real world it will be a nightmare to check if everything was properly encoded in a new dataset, with proper dummy variables, that are consistent with our saved models. How can I establish a reliable workflow for ‘manage training data’ -> ‘build a model’ -> ‘store model’ -> ‘generate predictions for a new data, where categories in categorical variables can be omitted’, without manually checking the proper encoding of categorical variables?

Hope my question was clear. Any questions and feedback will be highly appreciated.

Last edited by Stanislav Martsen; 30 Jan 2017, 08:55. Reason: Entered some code for clarity.
Tags: categorical
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

30 Jan 2017, 09:15

Yes, this problem crops up from time to time. Evidently, when Stata is working with the second data set, it has no way to know that there is a level of variable Category in some other data set that will eventually prove important. So you need to tell it about that.

So, just as you can save the estimates from a regression, and then re-use them later on in a different context, there is the -label save- command that lets you save labels. So the overall scheme would be like this:

Code:

// WORKING WITH FIRST DATA SET encode Category, gen(category_i) label save category_i using category_i_labeler regress Continuous i.category_i estimates save data_set_1_regression, replace

Then in your second data set:

Code:

// WORKING WITH SECOND DATA SET do category_i_labeler // OR CAN USE -run- OR -include- encode category, gen(category_i) label(category_i) estimates use data_set_1_regression predict whatever // OR -margins-, OR OTHER POST-ESTIMATION COMMAND

The -label save- command creates a do-file, category_i_labeler.do, which contains code to replicate the label category_i. (When -encode- is used, the value label is given the same name as the variable being created, unless you override that default.) So when you get to the second data set, you just need to execute that do-file before you use -encode-, and tell -encode that you want it to use that specific label in the -label()- option.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35708
#3

30 Jan 2017, 09:23

Clyde gives excellent advice, but there is yet more to the question.

It's not just a case of Stata using different encodings that aren't compatible, which you work around by using the same value labels.

If Category B is not present in the second dataset, then the original model doesn't apply and predictions from it in the second dataset will not make complete sense. I haven't tested whether you can get Stata to produce predictions at all.
Comment
Stanislav Martsen

Join Date: Jan 2017

Posts: 5
#4

30 Jan 2017, 11:00

Dear Clyde,

thank you very much for a very helpful answer. It works! Moreover, I discovered that there is a possibility to save all labels in a dataset as do-file:

label save using "dataset-labeling-file", replace

This do-file makes a complete linkage between value (e.g. "Cat A") and appropriate dummy var integer, and vice versa.
This differs from a workflow that Python or R users are used to, but nevertheless it makes sense is Stata world.

If Category B is not present in the second dataset, then the original model doesn't apply and predictions from it in the second dataset will not make complete sense. I haven't tested whether you can get Stata to produce predictions at all.

Dear Nick, I do not fully understand your notice. It's completely valid to have a new dataset having only 1 observation (for example), that needs prediction. At least from point of view of statistical learning paradigm. You can throw one new observation in the estimated linear model, and you'll get your prediction (if the observation is full and valid). In my example I have 3 new observation with 2 categorical groups, and that should be okay for any statistical learning tool.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35708
#5

30 Jan 2017, 11:08

The first model has one predictor in particular, presumably an indicator 1 if B and 0 if not. To apply that model to the second dataset, you need the corresponding predictor to be in the dataset. Will Stata let you do this? (I haven't tried) Do you feel completely happy with the principle of applying that model when there are no Bs whatsoever in the second dataset?

Last edited by Nick Cox; 30 Jan 2017, 11:20.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#6

30 Jan 2017, 11:23

Stanislav,

Regarding

Code:

label save using "dataset-labeling-file", replace

Yes, indeed, you can. And I do this often, myself, in projects with several files that share common variables and require common labeling. But one should always be a bit cautious applying "across the board" solutions to focused problems. There are situations where you actually want to use different labeling for some variable(s) in a data set. If you run -do dataset_labeling_file-, any same-named labels will be clobbered by what's in dataset_labeling_file.do. Usually that's what you want, but occasionally it isn't. Since your question was posed in terms of a particular variable, I proposed a particularized solution.
Comment
Stanislav Martsen

Join Date: Jan 2017

Posts: 5
#7

30 Jan 2017, 11:32

Dear Nick,

no, first dataset consists of variable (1) with three groups: Cat A, Cat B, Cat C, and continuous variable (2).
I can reformulate the issue. When I transform variable (1) in first dataset into labeled numeric, then I get the next results:

Cat A --> integer '1' with label 'Cat A'
Cat B --> integer '2' with label 'Cat B'
Cat C --> integer '3' with label 'Cat C'

Then I run a model on data, save it, close Stata and re-open it. Now I want to consider got the next second dataset, from which continuous variable should be predicted:

+----------+
| category |
|----------|
1. | Cat A |
2. | Cat A |
3. | Cat C |
+----------+

As we see, there is no Cat B here at all. Can it happen in real world? Of course! Moreover, my hypothetical new data may consist only from one observation:

+----------+
| category |
|----------|
1. | Cat A |
+----------+

That is, it's one obs which belongs to Cat A.

Now if I will carelessly try to make a labeled numeric variable in this second dataset, I'll get the next results:

Cat A --> integer '1' with label 'Cat A'
Cat C -->integer '2' with label 'Cat C'

Now this is perfectly fine for us, but it's a problem for a saved linear model, that will treat '2' as Cat B. And our estimation and predictions will be totally wrong.

Last edited by Stanislav Martsen; 30 Jan 2017, 11:33. Reason: Minor corrections.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35708
#8

30 Jan 2017, 11:42

I think I understand how encode works. My point is different. Let's put it another way.

Suppose you have a dataset and fit a model with predictors ABC, DEF, GHI.

Now you have a second dataset and with variables ABC, DEF. How do you apply the first model here?

In your case, you can apply the same syntax, say

Code:

regress continuos i.category_i

but it doesn't mean the same in your second datasets. In your first model, there are tacitly indicators for A, B, C. In your second model, only for A and C. Hence my questions at the end of #5.
Comment
Stanislav Martsen

Join Date: Jan 2017

Posts: 5
#9

30 Jan 2017, 12:04

Dear Nick,

I do not want to regress on second data set. I want to predict from it. It's my new data, that do not contain DV.
I want to estimate model on first dataset, and predict with that model from a second.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35708
#10

30 Jan 2017, 12:07

Indeed; I understand that.

I don't know how to make my point any clearer. It's using the first model in the second dataset that seems to me problematic.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30111

#11

30 Jan 2017, 14:14

I'll attempt to be Solomonic here and argue that Nick and Stanislav are both partially right.

In general, if you try to apply estimates from one model to data that lacks one of the model variables you are doing something that is, usually, nonsensical, and, in fact, Stata won't even let you do it: -predict- will complain that it can't find the missing variable and abort. Now, you could argue that applying a set of regression coefficients and omitting the coefficient for one variable is equivalent to applying the set of coefficients to a data set in which that particularly variable always takes on the value zero. So you could coax Stata into doing this by generating that variable with a zero value in every observation. And that would be algebraically valid, though one would have to ponder whether constraining that variable to zero makes any sense in the context of the particular problem and model.

But it works slightly differently when we have a categorical variable. If the categorical variable is altogether missing, then you cannot apply estimates created in data that contains that variable. But if one level of the categorical variable simply is not instantiated in the second data set, then, in principle it could be done. Again, one would want to ask why that level doesn't appear in the second data set and whether that implies that the population that data set represents is too different from the first data set to warrant this kind of out-of-sample prediction. (It is not, strictly speaking, extrapolation, but one should always be cautious applying a model to a new population whose variables have markedly different distributions from the one the model was fitted to.) But it can be done. And with factor-variable notation, Stata apparently handles this gracefully. Not only does -predict- run, but it produces the correct results.

All of this is illustrated here:

Code:

. //      DO A REGRESSION WITH A CONTINUOUS
. //      AND A CATEGORICAL PREDICTOR
. clear*

. sysuse auto
(1978 Automobile Data)

. gen obs_no = _n

. regress price mpg i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(5, 63)        =      4.39
       Model |   149020603         5  29804120.7   Prob > F        =    0.0017
    Residual |   427776355        63  6790100.88   R-squared       =    0.2584
-------------+----------------------------------   Adj R-squared   =    0.1995
       Total |   576796959        68  8482308.22   Root MSE        =    2605.8

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -280.2615   61.57666    -4.55   0.000    -403.3126   -157.2103
             |
       rep78 |
          2  |   877.6347   2063.285     0.43   0.672     -3245.51     5000.78
          3  |   1425.657   1905.438     0.75   0.457    -2382.057    5233.371
          4  |   1693.841   1942.669     0.87   0.387    -2188.274    5575.956
          5  |   3131.982   2041.049     1.53   0.130    -946.7282    7210.693
             |
       _cons |   10449.99   2251.041     4.64   0.000     5951.646    14948.34
------------------------------------------------------------------------------

. 
. //      SAVE THE ESTIMATES FOR LATER USE
. tempfile estimates

. estimates save `estimates'
file C:\Users\CLYDES~1\AppData\Local\Temp\ST_0i000001.tmp saved

. 
. //      MAKE SOME PREDICTIONS AND SAVE THEM
. //      FOR LATER COMPARISON
. predict xb, xb
(5 missing values generated)

. predict resid, resid
(5 missing values generated)

. keep obs_no xb resid

. tempfile predictions

. save `predictions'
file C:\Users\CLYDES~1\AppData\Local\Temp\ST_0i000002.tmp saved

. 
. 
. //      CREATE A "NEW" DATA SET THAT LACKS THE
. //      CONTINUOUS PREDICTOR AND TRY TO APPLY
. //      THEM TO MAKE A PREDICTION
. sysuse auto, clear
(1978 Automobile Data)

. drop mpg

. estimates use `estimates'

. capture noisily predict xb, xb // FAIL!
variable mpg not found
variable mpg not found

. //      BUT IF YOU FILL IN THE MISSING VARIABLE WITH ZEROES:
. gen mpg = 0

. predict xb, xb // IT RUNS, BUT IS PROBABLY MEANINGLESS
(5 missing values generated)

. 
. //      CREATE A NEW DATA SET THAT LACKS ONE LEVEL
. //      OF THE CATEGORICAL PREDICTOR AND TRY TO
. //      APPLY THE ESTIMATES TO MAKE A PREDICTION
. sysuse auto, clear
(1978 Automobile Data)

. gen obs_no = _n

. drop if rep78 == 2
(8 observations deleted)

. predict xb_no2, xb // IT RUNS
(5 missing values generated)

. predict resid_no2, resid // IT RUNS
(5 missing values generated)

. //      AND THE PREDICTIONS ARE THE SAME AS BEFORE
. merge 1:1 obs_no using `predictions', keep(match) nogenerate

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                66  
    -----------------------------------------

. assert xb_no2 == xb & resid_no2 == resid

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35708
#12

30 Jan 2017, 14:55

Clyde: Excellent analysis.
Comment

Announcement