Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using encoded variables

    Is there a nice way to use value labels (encoded value) instead of the decoded value?

    For example, in MariaDB, here I have an variable of type enum, which is similar to how Stata encodes a categorical variable:

    Code:
    > describe choices type;
    +-------+-------------------------------------------------------------------------------+------+-----+---------+-------+
    | Field | Type                                                                          | Null | Key | Default | Extra |
    +-------+-------------------------------------------------------------------------------+------+-----+---------+-------+
    | type  | enum('instructions','qualify','belief','gamble','commit','win','lose','exit') | NO   |     | NULL    |       |
    +-------+-------------------------------------------------------------------------------+------+-----+---------+-------+

    If I want to refer to a specific encoded value, I simply do so:

    Code:
    > select avg(value) from choices where type='belief';
    +------------+
    | avg(value) |
    +------------+
    |    18.7899 |
    +------------+

    I could also use the decoded value itself:

    Code:
    > select avg(value) from choices where type=3;
    +------------+
    | avg(value) |
    +------------+
    |    18.7899 |
    +------------+

    As you probably know, the former doesn't work in Stata:

    Code:
    . use https://www.stata-press.com/data/r18/hbp3, clear
    
    . reg hbp year if age_grp=="20-24"
    type mismatch
    r(109);
    
    . label list
    sexlbl:
               0 Male
               1 Female
    yn:
               0 No
               1 Yes
    racefmt:
               1 White
               2 Black
               3 Hispanic
               4 Asian
               5 Am India
               8 Other
    agefmt:
               1 Under 15
               2 15–19
               3 20–24
               4 25–29
               5 30–34
               6 35–39
               7 40–44
               8 45+
    
    . reg hbp year if age_grp==3
    
          Source |       SS           df       MS      Number of obs   =       407
    -------------+----------------------------------   F(1, 405)       =      2.99
           Model |  .085452698         1  .085452698   Prob > F        =    0.0844
        Residual |  11.5607389       405  .028545034   R-squared       =    0.0073
    -------------+----------------------------------   Adj R-squared   =    0.0049
           Total |  11.6461916       406  .028685201   Root MSE        =    .16895
    
    ------------------------------------------------------------------------------
             hbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
            year |  -.0105937   .0061228    -1.73   0.084    -.0226302    .0014427
           _cons |   21.12433   12.19211     1.73   0.084     -2.84339    45.09204
    ------------------------------------------------------------------------------

    Is there an automatic way to do this? Maybe something like this:

    Code:
    . reg hbp year if test_label(age_grp,'20-24')

    I've long adhered to efficient data encoding practices. But storage and memory are now extremely cheap relative to my time costs. As a result, I'm leaning toward simply leaving categorical variables as strings, resulting in more readable code.

    But I'd love to hear of a solution that balances efficient encoding with code readability.

    Cheers!

  • #2
    Originally posted by lucas reddinger View Post
    As you probably know, the former doesn't work in Stata:
    I beg to differ. Try

    Code:
    regress hbp year if age_grp=="20–24":agefmt
    and read more on that in Higbee (2004).

    Note that the "–" (U+2013) in "20–24" is not the same as "-" (U+002D).


    Higbee, K. 2004. Stata tip 14: Using value labels in expressions. The Stata Journal, 4(4), 486--487.

    Comment


    • #3
      A quick follow-up.

      Originally posted by lucas reddinger View Post
      Is there an automatic way to do this? Maybe something like this:
      You can get a value label name of a numeric(!) variable via

      Code:
      local labelname : value label varname
      Combining this with the approach I have shown above, you get

      Code:
      regress hbp year if age_grp=="20–24":`:value label age_grp'
      or, because you prefer more readable code,

      Code:
      local labelname : age_grp
      regress hbp year if age_grp=="20–24":`labelname'
      (neither code is tested).


      Originally posted by lucas reddinger View Post
      I've long adhered to efficient data encoding practices. But [...] I'm leaning toward simply leaving categorical variables as strings
      Note that a numeric variable with a value label is still a numeric variable, not a string variable. You may decode a numeric variable with a value label to create a string variable. You could then refer to the string values directly as in

      Code:
      decode age_grp , generate(string_age_grp)
      regress hbp year if string_age_grp=="20–24"
      Note, however, that you cannot use string variables as predictors (or outcomes) in any estimation command.

      On the balance between storage space and readable code: Note that Stata's value labels may be attached to multiple variables. Thus, the strings (i.e., alphanumeric characters, which consume storage space) are only saved once, while all variables that use these strings are still more efficiently stored in numeric formats.

      Comment


      • #4
        Originally posted by daniel klein View Post

        I beg to differ. Try

        Code:
        regress hbp year if age_grp=="20–24":agefmt
        and read more on that in Higbee (2004).

        Note that the "–" (U+2013) in "20–24" is not the same as "-" (U+002D).


        Higbee, K. 2004. Stata tip 14: Using value labels in expressions. The Stata Journal, 4(4), 486--487.
        Thanks, that's exactly what I was looking for!

        Good catch on the Unicode en-dash as well.

        Thanks also for your follow-up.

        Best,
        Lucas

        Comment

        Working...
        X