Using encoded variables

lucas reddinger

Join Date: May 2021
Posts: 26

Using encoded variables

13 Sep 2024, 12:32

Is there a nice way to use value labels (encoded value) instead of the decoded value?

For example, in MariaDB, here I have an variable of type enum, which is similar to how Stata encodes a categorical variable:

Code:

> describe choices type;
+-------+-------------------------------------------------------------------------------+------+-----+---------+-------+
| Field | Type                                                                          | Null | Key | Default | Extra |
+-------+-------------------------------------------------------------------------------+------+-----+---------+-------+
| type  | enum('instructions','qualify','belief','gamble','commit','win','lose','exit') | NO   |     | NULL    |       |
+-------+-------------------------------------------------------------------------------+------+-----+---------+-------+

If I want to refer to a specific encoded value, I simply do so:

Code:

> select avg(value) from choices where type='belief';
+------------+
| avg(value) |
+------------+
|    18.7899 |
+------------+

I could also use the decoded value itself:

Code:

> select avg(value) from choices where type=3;
+------------+
| avg(value) |
+------------+
|    18.7899 |
+------------+

As you probably know, the former doesn't work in Stata:

Code:

. use https://www.stata-press.com/data/r18/hbp3, clear

. reg hbp year if age_grp=="20-24"
type mismatch
r(109);

. label list
sexlbl:
           0 Male
           1 Female
yn:
           0 No
           1 Yes
racefmt:
           1 White
           2 Black
           3 Hispanic
           4 Asian
           5 Am India
           8 Other
agefmt:
           1 Under 15
           2 15–19
           3 20–24
           4 25–29
           5 30–34
           6 35–39
           7 40–44
           8 45+

. reg hbp year if age_grp==3

      Source |       SS           df       MS      Number of obs   =       407
-------------+----------------------------------   F(1, 405)       =      2.99
       Model |  .085452698         1  .085452698   Prob > F        =    0.0844
    Residual |  11.5607389       405  .028545034   R-squared       =    0.0073
-------------+----------------------------------   Adj R-squared   =    0.0049
       Total |  11.6461916       406  .028685201   Root MSE        =    .16895

------------------------------------------------------------------------------
         hbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        year |  -.0105937   .0061228    -1.73   0.084    -.0226302    .0014427
       _cons |   21.12433   12.19211     1.73   0.084     -2.84339    45.09204
------------------------------------------------------------------------------

Is there an automatic way to do this? Maybe something like this:

Code:

. reg hbp year if test_label(age_grp,'20-24')

I've long adhered to efficient data encoding practices. But storage and memory are now extremely cheap relative to my time costs. As a result, I'm leaning toward simply leaving categorical variables as strings, resulting in more readable code.

But I'd love to hear of a solution that balances efficient encoding with code readability.

Cheers!

Tags: None

daniel klein

Join Date: Mar 2014

Posts: 3845
#2

13 Sep 2024, 13:00

Originally posted by lucas reddinger View Post

As you probably know, the former doesn't work in Stata:

I beg to differ. Try

Code:

regress hbp year if age_grp=="20–24":agefmt

and read more on that in Higbee (2004).

Note that the "–" (U+2013) in "20–24" is not the same as "-" (U+002D).

Higbee, K. 2004. Stata tip 14: Using value labels in expressions. The Stata Journal, 4(4), 486--487.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3845
#3

13 Sep 2024, 13:13

A quick follow-up.

Originally posted by lucas reddinger View Post

Is there an automatic way to do this? Maybe something like this:

You can get a value label name of a numeric(!) variable via

Code:

local labelname : value label varname

Combining this with the approach I have shown above, you get

Code:

regress hbp year if age_grp=="20–24":`:value label age_grp'

or, because you prefer more readable code,

Code:

local labelname : age_grp regress hbp year if age_grp=="20–24":`labelname'

(neither code is tested).

Originally posted by lucas reddinger View Post

I've long adhered to efficient data encoding practices. But [...] I'm leaning toward simply leaving categorical variables as strings

Note that a numeric variable with a value label is still a numeric variable, not a string variable. You may decode a numeric variable with a value label to create a string variable. You could then refer to the string values directly as in

Code:

decode age_grp , generate(string_age_grp) regress hbp year if string_age_grp=="20–24"

Note, however, that you cannot use string variables as predictors (or outcomes) in any estimation command.

On the balance between storage space and readable code: Note that Stata's value labels may be attached to multiple variables. Thus, the strings (i.e., alphanumeric characters, which consume storage space) are only saved once, while all variables that use these strings are still more efficiently stored in numeric formats.
1 like
Comment
lucas reddinger

Join Date: May 2021

Posts: 26
#4

13 Sep 2024, 13:34

Originally posted by daniel klein View Post

I beg to differ. Try

Code:

regress hbp year if age_grp=="20–24":agefmt

and read more on that in Higbee (2004).

Note that the "–" (U+2013) in "20–24" is not the same as "-" (U+002D).

Higbee, K. 2004. Stata tip 14: Using value labels in expressions. The Stata Journal, 4(4), 486--487.

Thanks, that's exactly what I was looking for!

Good catch on the Unicode en-dash as well.

Thanks also for your follow-up.

Best,
Lucas
Comment

Announcement

Using encoded variables

Comment

Comment

Comment