Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Encode or sencode with specific reserved value labels

    Hi Statalist

    I am encoding a list of categorical variables that are common in four datafiles, one file per city. In each city the items of the list are not always the exact same, but there are certain items such as "refused" that recur. So there is partial intersection if we think in terms of Venn diagrams.

    Sencode is great at encoding and also creating a label that I can then apply to the same var in the next city, so that common values get the same value label. My problem is that I want to "reserve" /define some particular values myself. Specifically, -3 "refused", -2 "do not know," -1 "missing", and the most important one, 9 "other(specify)". The 9 is the problematic value because for whatever reason, once I add it to the label, sencode then skips all digits below 9 when adding to that label. This is contradictory to the manual from what I can tell, which says it goes from 1 to x without exception. By example I have a sex variable, and due to this problem it ends up encoded as 9 Other(specify), 10 Female and 11 Male. I want sencode to start from 1 Female, and so on.

    Of course I can make other(Specify) -4 and this solves the problem, but it would be much neater if the 9 was kept as this matches the questionnaire. I have thought up a few ugly solutions to this, but it feels like there should be something simpler. What am I missing and what can I try?

    Code:
    foreach var of local encode_list {
    label define `var'_label -3 "Refused" -2 "Do not know" -3 "Missing" 9 "Other (Specify)"
    sencode `var', label(`var'_label) replace
    }
    Thanks very much in advance for any advice!
    Last edited by Bruce McDougall; 13 Oct 2020, 02:52.

  • #2
    I would use a command like tabm from tab_chi (SSC) to get a combined list of the replies, work out all your desired labels and then issue a loop around an encode command (or look at multencode from SSC).

    The mention of multiple datafiles prompts a warning -- I guess it should be obvious -- that Stata won't know and won't care about label definitions in other datasets. The only simple (relatively simple!) and safe strategy is to get your data together in one place and then encode using identical value labels. The spirit and purpose of consistent labelling is likely to be defeated if you use a different set of value labels for each variable, which is what your code is doing.

    Comment


    • #3
      Hi Nick

      I will have a look at tabm and tab_chi, multencode won't work. I failed to mention that I label save after the first datafile, then run this .do to have the labels in the subsequent files.

      But the crux of my issue occurs in the first data file, so let's limit the conversation to this file. Maybe I wasn't clear - this is a large set of differing categorical vars that need value labels. Sex, previous occupation, access to electricity, and so on. These naturally have different values but what is common is that they all include other(specify) as an answer. I want them all to have a numlabel of 9 to represent this "Other(specify)". Manually typing these all out is an option but is very painful.

      What would be great is an option to (/s)encode, where it creates val labels for distinct values (groups) sequentially from 1, but leaving/jumping 9 as an exception that is pre-defined to represent "other(specify)". Apologies, I may be talking past you a bit here.

      Regards,
      Bruce



      Comment


      • #4
        There are questions on different levels here, which is fine, but let's distinguish some answers.

        An extra option for encode is for StataCorp and that's a longer-term issue.

        An extra option for sencode (SSC, or Roger Newson's sites) is for him, Roger Newson, and if you convince him an extra hook will come quickly.

        Then there's what you do now. I find the assertion that
        multencode will not work puzzling without explanation; say rather that it appears irrelevant to what you want to do. I don't have any extra advice, unfortunately.

        Comment


        • #5
          Hi Nick

          Thanks for your input. Yes, better to say that multencode felt less relevant because (thinking more clearly) my issue is really at the level of a single variable. Specifically, I am requiring a way to automatically expand a variable's existing value labels using values below the highest number already in the label. So new values interspersed between existing ones.

          For now I am going to do one of my ugly solutions, which is text process the lists using regex and then paste into my .do. This will create a very big messy but functional .do.

          I would be interested to see what Roger thinks.

          Thanks again,
          Bruce

          Comment


          • #6
            It is hard to give specific advice without example data to play around with. What I can tell already is that

            Code:
            label define `var'_label -3 "Refused" -2 "Do not know" -3 "Missing" 9 "Other (Specify)"
            mentions the value -3 twice; this is most likely not what you want.

            If everything else fails, you can easily change labels with elabel (SSC) after encode has defined them. Here is an example:

            Code:
            // some toy data
            clear
            input str14 s
            "foo"
            "bar"
            "other(specify)"
            "z"
            "missing"
            end
            
            // this is the dataset
            list
            
            // define your label
            label define mylabel -3 "missing" 9 "other(specify)"
            
            // now encode your variable
            encode s , generate(n) label(mylabel) 
            
            // this is the new dataset
            list
            list , nolabel
            
            // ... and your label
            label list
            
            // now fix the value label (and the variables that have it attached)
            *ssc install elabel
            elabel recode mylabel (10/17 = 1/8) , recodevarlist
            
            // the final dataset
            list
            list , nolabel
            
            // and the (fixed) label
            label list

            Here is the relevant output

            Code:
            . // this is the dataset
            . list
            
                 +----------------+
                 |              s |
                 |----------------|
              1. |            foo |
              2. |            bar |
              3. | other(specify) |
              4. |              z |
              5. |        missing |
                 +----------------+
            
            [...]
            
            . // this is the new dataset
            . list
            
                 +---------------------------------+
                 |              s                n |
                 |---------------------------------|
              1. |            foo              foo |
              2. |            bar              bar |
              3. | other(specify)   other(specify) |
              4. |              z                z |
              5. |        missing          missing |
                 +---------------------------------+
            
            . list , nolabel
            
                 +---------------------+
                 |              s    n |
                 |---------------------|
              1. |            foo   11 |
              2. |            bar   10 |
              3. | other(specify)    9 |
              4. |              z   12 |
              5. |        missing   -3 |
                 +---------------------+
            
            . // ... and your label
            . label list
            mylabel:
                      -3 missing
                       9 other(specify)
                      10 bar
                      11 foo
                      12 z
            
            . // now fix the value label (and the variables that have it attached)
            . *ssc install elabel
            . elabel recode mylabel (10/17 = 1/8) , recodevarlist
            (n: 3 changes made)
            
            . // the final dataset
            . list
            
                 +---------------------------------+
                 |              s                n |
                 |---------------------------------|
              1. |            foo              foo |
              2. |            bar              bar |
              3. | other(specify)   other(specify) |
              4. |              z                z |
              5. |        missing          missing |
                 +---------------------------------+
            
            . list , nolabel
            
                 +---------------------+
                 |              s    n |
                 |---------------------|
              1. |            foo    2 |
              2. |            bar    1 |
              3. | other(specify)    9 |
              4. |              z    3 |
              5. |        missing   -3 |
                 +---------------------+
            
            . // and the (fixed) label
            . label list
            mylabel:
                      -3 missing
                       1 bar
                       2 foo
                       3 z
                       9 other(specify)

            You can read more about elabel recode in the respective help-file. Option recodevarlist is not (yet) documented; it recodes all variables that have the respective value label attached.

            Option varlist (which I have not used) is documented and will, erroneously, also recode the respective variables. The latter behavior is a bug that I will fix in an update in a couple of days.


            A final word of caution: If you are doing this for many variables in many datasets, be careful to recode value labels and, more importantly, variables once only. Also, double-check your results.

            Comment


            • #7
              I could not let this go. Here is a program, encodelabel, that creates labels with integer values starting at 1 (or any other specified minimum value) and skipping any pre-defined values in a value label.

              Code:
              *! version 1.0.0 daniel klein 15oct2020
              program encodelabel
                  version 11.2
                  
                  syntax varname(string) [ if ] [ in ] ///
                  , Generate(name) Label(name) [ MIN(integer 1) ]
                  
                  confirm new variable `generate'
                  
                  marksample touse , strok
                  
                  mata : encodelabel("`varlist'", "`touse'", "`label'", `min')
                  
                  encode `varlist' if `touse' , generate(`generate') label(`label')
              end
              
              version 11.2
              
              mata :
              
              mata set matastrict on
              
              void encodelabel(string scalar vname,
                               string scalar touse,
                               string scalar lname,
                               real   scalar count)
              {
                  real   colvector values
                  string colvector labels
                  string colvector strvar
                  real   scalar    i
              
                  if ( !st_vlexists(lname) ) {
                      errprintf("value label %s not found\n", lname)
                      exit(111)
                  }
                  
                  pragma unset values
                  pragma unset labels
                  
                  st_vlload(lname, values, labels)
                  
                  strvar = uniqrows(st_sdata(., vname, touse))
                  
                  for (i=1; i<=rows(strvar); ++i) {
                      if ( anyof(labels, strvar[i]) ) continue
                      while ( anyof(values, count) ) count++
                      labels = (labels\ strvar[i])
                      values = (values\ count)
                  }
                  
                  st_vlmodify(lname, values, labels)
                  
                  values = select(values, (labels:==""))
                  if ( !rows(values) ) return
                  assert( (rows(values)==1) )
                  (void) _stata(sprintf(`"_label define %s %f "" , modify"', lname, values))
              }
              
              end
              exit

              Here is the program applied to the example in #6

              Code:
              . // this is the dataset
              . list
              
                   +----------------+
                   |              s |
                   |----------------|
                1. |            foo |
                2. |            bar |
                3. | other(specify) |
                4. |              z |
                5. |        missing |
                   +----------------+
              
              . // define your label
              . label define mylabel -3 "missing" 9 "other(specify)"
              
              . // apply -encodelabel-
              . encodelabel s , generate(n) label(mylabel)
              
              . // this is the new dataset
              . list
              
                   +---------------------------------+
                   |              s                n |
                   |---------------------------------|
                1. |            foo              foo |
                2. |            bar              bar |
                3. | other(specify)   other(specify) |
                4. |              z                z |
                5. |        missing          missing |
                   +---------------------------------+
              
              . // ... and your label
              . label list
              mylabel:
                        -3 missing
                         1 bar
                         2 foo
                         3 z
                         9 other(specify)
              Last edited by daniel klein; 15 Oct 2020, 12:30.

              Comment


              • #8
                Hi Daniel

                Thanks for your interest in this (probably common) challenge.

                w.r.t your first response:
                Yes the -3 was just a typo. The elabel recode would not help much because I would then have to recode all the resulting vars semi-manually, which is what I am trying to get around.

                w.r.t your second response:
                This looks great, thanks a heap. I am curious if the results could also go beyond 9, which I need, I will try it.

                I have not yet mastered using programs but this seems like a good time to do so! I guess I cannot ssc this program, so it must go into the .do I presume. I'll try get it going and get back to you.

                Thanks again. It seems like a worthy exercise so maybe it can go into ssc? I don't know the rules here.

                Regards,
                Bruce

                Comment


                • #9
                  Originally posted by Bruce McDougall View Post
                  This looks great, thanks a heap. I am curious if the results could also go beyond 9
                  Yes. The count starts at 1 (or wherever you specify) and is incremented by one for each not yet defined label; any pre-defined values will be skipped.

                  Originally posted by Bruce McDougall View Post
                  I guess I cannot ssc this program, so it must go into the .do I presume.
                  I have created a help file and fixed a bug but I have not sent the files to Kit Baum for upload to the SSC. I have attached both the ado and the help file to this post. You can download the files and put them either in the current working directory or into the PLUS directory (which is what ssc install would do). The PLUS directory on a Windows machine would usually be c:/ado/plus/e.


                  Originally posted by Bruce McDougall View Post
                  Thanks again. It seems like a worthy exercise so maybe it can go into ssc? I don't know the rules here.
                  To be honest, I have never come across this problem before, but the desired behavior seems reasonable. There are only few rules for SSC. The command/package name must not be in use already and a help file is required. I will do a bit more testing and wait for any problem reports from your side and then sent the files to Kit Baum, who actually does all the work.
                  Attached Files

                  Comment


                  • #10
                    Wow, I really didn't expect this level of help. I feel almost guilty to not be paying for all this work!

                    I know how to install this way with the .ado; will do and will keep you posted.

                    Regards,
                    Bruce

                    Comment


                    • #11
                      daniel klein did a great job here but a few extra comments are possible as footnotes. I don't recollect seeing this problem before either and I think there is a good reason for that beyond ignorance and fallible memory.

                      Categories like
                      "Refused" "Do not know" "Missing" I would tend to regard as flavours of missing values on the grounds that sometimes researchers want to see them listed and sometimes they would not. On the other hand a category like "Other (Specify)" I would want to give a high integer to ensure that it comes last by default in a table. So, I might encode all the other categories first and then add special labels for say 999 .a .b .c for these four cases. Although I am writing about my own inclinations, I doubt that they are totally idiosyncratic.


                      Comment


                      • #12
                        Hi Nick

                        Thanks for your response. This reply has been edited significantly after making a realization.

                        Edit: I looked again at the sorting on the questionnaire (the "Q") and realized that I made a mistake - "other" appears at 9 often but not always. The sorting on the Q is actually a little more complex, "other" seems to come at the end, but at times before some flavours of missing, as well as a select few different answers such as "none of the above". Coincidentally "other" landed at 9 so often I thought it was policy. So this issue I described of a number/position being reserved on the Q is probably less common that I was suggesting, my apologies.

                        That said, I think the simplest approach for me might still be to set it at 9 because this will match the Q in 90% of cases, the rest I can then adjust. So the extra control from Daniel is still welcome, and I can imagine other scenarios where it would be useful (including one where the Q did reserve a position as a policy).

                        Thanks again to both of you.

                        Regards,
                        Bruce
                        Last edited by Bruce McDougall; 16 Oct 2020, 09:05.

                        Comment


                        • #13
                          Hi Nick and Daniel

                          daniel klein I've been using your program and found it incredibly helpful. I think it would be great if it was added to the SSC.

                          Here are two practical examples of how I am using it:

                          First, a simple case where I encode a variable called q4, with values for missing, refused, do not know and other predefined, and all other necessary labels automatically created!

                          Code:
                          ren q4 temp
                          label define q4_label 1 "Refused" -2 "Do not know" -3 "Missing" 99 "Other"
                          encodelabel temp, generate(q4_numeric) label(q4_label)
                          drop temp
                          (By the way Dan it would be nifty if there was also an option for 'replace', so that I did not have to do the rename to temp trick. But that is not material in the bigger picture.)

                          Here is the result with numlabels:
                          Click image for larger version

Name:	encodelabel_example1.jpg
Views:	1
Size:	44.9 KB
ID:	1579465




                          My second example is more complicated, but really is the prize of this whole exercise. In this example, I identify two groups of variables, about 10 in each. All the variables of either group need to get the same set of value labels, with the predefined values, and if necessary said label must be expanded as new values are potentially found. In this example, I identify variables with "_hfias" or "_often" in the variable name using unab. I have set it up so that I can easily extent this to more, potentially doing the entire dataset in one fell swoop! (well, the string part). I can also then lab save for another file that should they need the same labels.

                          Code:
                          local groupidentifiers "hfias often"
                          foreach groupidentifier of local groupidentifiers {
                              unab grouplist:*_`groupidentifier'*
                              
                          label define `groupidentifier'_lab -1 "Refused" -2 "Do not know" -3 "Missing" 99 "Other"
                          
                          di "`grouplist'"  //just to see it makes sense
                          foreach var of local grouplist {
                          ren `var' temp
                          encodelabel temp,label(`groupidentifier'_lab) generate(`var')
                          drop temp
                          }
                          }
                          order $order
                          More generic examples might be instructive in showing how useful this can be. I used the above because I know they work. I tested the program and can report that it is working perfectly; I tested what happens when new values appear, and also when we go beyond 99. Both cases worked 100%, I'm very impressed.

                          Overall I think the program is useful for 1) allowing one to predefine values and 2) streamlining what would otherwise take more lines of code.

                          Nick I gave some thought to your solution but must admit I don't understand; I don't know how one would encode some values first and others later. The only ways I can imagine doing this, once it is encoded (without Dan's program) you still would have to manually find the numeric value assigned to "other" and then reassign to 999. Apologies if I am missing something. I am curious about this because naturally if the alternatives are a little more work, encodelabel would be useful in the SSC. I would be pretty proud if my query led to that, which I didn't even consider at the start!

                          By the way, I am not sure if I need to tag you guys in the reply for you to receive a notification.

                          Regards,
                          Bruce
                          Last edited by Bruce McDougall; 29 Oct 2020, 10:00.

                          Comment


                          • #14
                            Originally posted by Bruce McDougall View Post
                            By the way Dan it would be nifty if there was also an option for 'replace' [...]
                            I have implemented that; I have also sent the updated version to Kit Baum for upload to the SSC archives. encodelabel should be available from the SSC soon.


                            Originally posted by Bruce McDougall View Post
                            My second example is more complicated, but really is the prize of this whole exercise. In this example, I identify two groups of variables, about 10 in each. All the variables of either group need to get the same set of value labels, with the predefined values, and if necessary said label must be expanded as new values are potentially found.
                            That would probably call for combining the functionality of encodelabel and multencode(SSC). The latter would replace the loop over variables.

                            More generally speaking, the different flavors of encode, such as encoder, sencode, and multencode (all SSC), might benefit from an integrated approach that combines some features of these commands to produce results that are otherwise hard/impossible to produce by simply calling these commands in a specific order.
                            Last edited by daniel klein; 30 Oct 2020, 17:49.

                            Comment


                            • #15
                              Hi Daniel

                              That is very cool, well done. I will make sure to give you credit when I use it.

                              Just as a heads up, I tried this morning to install it from the ssc and I received two different errors.

                              Code:
                              ssc install encodelabel
                              produced "encodelabel" not found at SSC" , whereas trying
                              Code:
                              ssc install encode_label
                              produced "file http://fmwww.bc.edu/repec/bocode/e/encode_label.ado not found". I added the underscore after searching for encodelabel and finding that instead.

                              Regards,
                              Bruce
                              Last edited by Bruce McDougall; 02 Nov 2020, 01:48.

                              Comment

                              Working...
                              X