Is there a nice way to use value labels (encoded value) instead of the decoded value?
For example, in MariaDB, here I have an variable of type enum, which is similar to how Stata encodes a categorical variable:
If I want to refer to a specific encoded value, I simply do so:
I could also use the decoded value itself:
As you probably know, the former doesn't work in Stata:
Is there an automatic way to do this? Maybe something like this:
I've long adhered to efficient data encoding practices. But storage and memory are now extremely cheap relative to my time costs. As a result, I'm leaning toward simply leaving categorical variables as strings, resulting in more readable code.
But I'd love to hear of a solution that balances efficient encoding with code readability.
Cheers!
For example, in MariaDB, here I have an variable of type enum, which is similar to how Stata encodes a categorical variable:
Code:
> describe choices type; +-------+-------------------------------------------------------------------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+-------------------------------------------------------------------------------+------+-----+---------+-------+ | type | enum('instructions','qualify','belief','gamble','commit','win','lose','exit') | NO | | NULL | | +-------+-------------------------------------------------------------------------------+------+-----+---------+-------+
If I want to refer to a specific encoded value, I simply do so:
Code:
> select avg(value) from choices where type='belief'; +------------+ | avg(value) | +------------+ | 18.7899 | +------------+
I could also use the decoded value itself:
Code:
> select avg(value) from choices where type=3; +------------+ | avg(value) | +------------+ | 18.7899 | +------------+
As you probably know, the former doesn't work in Stata:
Code:
. use https://www.stata-press.com/data/r18/hbp3, clear . reg hbp year if age_grp=="20-24" type mismatch r(109); . label list sexlbl: 0 Male 1 Female yn: 0 No 1 Yes racefmt: 1 White 2 Black 3 Hispanic 4 Asian 5 Am India 8 Other agefmt: 1 Under 15 2 15–19 3 20–24 4 25–29 5 30–34 6 35–39 7 40–44 8 45+ . reg hbp year if age_grp==3 Source | SS df MS Number of obs = 407 -------------+---------------------------------- F(1, 405) = 2.99 Model | .085452698 1 .085452698 Prob > F = 0.0844 Residual | 11.5607389 405 .028545034 R-squared = 0.0073 -------------+---------------------------------- Adj R-squared = 0.0049 Total | 11.6461916 406 .028685201 Root MSE = .16895 ------------------------------------------------------------------------------ hbp | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- year | -.0105937 .0061228 -1.73 0.084 -.0226302 .0014427 _cons | 21.12433 12.19211 1.73 0.084 -2.84339 45.09204 ------------------------------------------------------------------------------
Is there an automatic way to do this? Maybe something like this:
Code:
. reg hbp year if test_label(age_grp,'20-24')
I've long adhered to efficient data encoding practices. But storage and memory are now extremely cheap relative to my time costs. As a result, I'm leaning toward simply leaving categorical variables as strings, resulting in more readable code.
But I'd love to hear of a solution that balances efficient encoding with code readability.
Cheers!
Comment