Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rename variables with labels when labels are multiple words?

    Background/big problem

    I have about 60 large data sets, each a different round of a repeated survey. I need to construct a single pooled data set containing only a handful of variables.

    Some variables (weights, identifiers, demographics) are in every data set and some (substantive questions) are only in some.

    The way the weights are stored in the data changed a couple of times, so I need to do slightly different manipulations for data sets from different periods.

    The substantive variables I care about have identical labels across the data sets, but completely different names. Other variables have 2-3 variations of both the name and the label.

    All the data sets live in the same folder, along with other data sets (additional rounds of the same survey) that don't have any of the substantive variables I need.

    I've started doing this the long way (open each file, do the necessary wrangling, save as a new file, with the intention of appending at the end). But there must be a way of doing it in a loop -- maybe I can leverage the nearly-consistent labels?

    Specific problem

    I am trying to rename my variables to their variable labels. My code is:

    Code:
    foreach var of varlist _all {
    local lab: variable lab `var'
    rename `var' `lab'
    }
    This gives an r198 error, along with the syntax for rename. I suspect the issue is that the labels are multiple words, but I'm not sure. How do I get around that? Something using subinstr to the labels before renaming?

  • #2
    The naming rules for labels and variable names are quite different. Almost anything goes for labels, but there are a number of restrictions for variable names. See

    Code:
    help naming_conventions
    For you specific problem, one way is to pass the label through -strtoname()-.

    Code:
    foreach var of varlist *{
        rename `var' `=strtoname("`:var lab `var''")'
    }
    Last edited by Andrew Musau; 11 Nov 2021, 04:25.

    Comment


    • #3
      Josephine, I'm afraid most tasks may not be done with loops as variable names and values may be arbitrarily inconsistent across surveys. I would do the following step by step.

      First, list all variables of all surveys that need to be kept from original datasets in, for example, an Excel file, including their original names and rules of value assignment (the latter is important, for example, "gender" is 1 and 2 in some survey but 0 and 1 in other surveys.), as well as the final uniform names and value assignment you'd like to set. Based on my own experience, such a file makes me very clear about the whole picture and where I'm heading for.

      Second, clean the data one by one. My first move of cleaning a data is always to keep substantive variables only. And then rename and revalue variables based on the uniform rules in the file from step one -- I don't think renaming variables to their variable labels is a good idea as variable labels may be different across surveys.

      Third, after the preliminary cleaning in step two (uniform names and value assignment now), append all 60 datasets with a couple of lines of code.

      Last but not least, clean the appended data more deeply, including setting new variable and value labels, correcting errors, etc.

      Of course, all steps above except the first need to be implemented with do files.

      Comment


      • #4
        Thanks Andrew. I had to add a "capture" but otherwise that worked perfectly!

        Comment


        • #5
          Originally posted by Fei Wang View Post
          Josephine, I'm afraid most tasks may not be done with loops as variable names and values may be arbitrarily inconsistent across surveys. I would do the following step by step.

          First, list all variables of all surveys that need to be kept from original datasets in, for example, an Excel file, including their original names and rules of value assignment (the latter is important, for example, "gender" is 1 and 2 in some survey but 0 and 1 in other surveys.), as well as the final uniform names and value assignment you'd like to set. Based on my own experience, such a file makes me very clear about the whole picture and where I'm heading for.

          Second, clean the data one by one. My first move of cleaning a data is always to keep substantive variables only. And then rename and revalue variables based on the uniform rules in the file from step one -- I don't think renaming variables to their variable labels is a good idea as variable labels may be different across surveys.

          Third, after the preliminary cleaning in step two (uniform names and value assignment now), append all 60 datasets with a couple of lines of code.

          Last but not least, clean the appended data more deeply, including setting new variable and value labels, correcting errors, etc.

          Of course, all steps above except the first need to be implemented with do files.
          Yeah, you are probably right. I feel like I should be able to avoid the deeply repetitive coding for cleaning each data set, but as you point out, there are potentially hidden hazards.

          Comment


          • #6
            I think Fei Wang gives very good advice.

            If you

            Originally posted by Josephine George View Post
            had to add a "capture"
            but did nothing with/conditional on the captured return code, there is a high risk of

            Originally posted by Josephine George View Post
            potentially hidden hazards
            that will mess things ups -- in the worst case without you even noticing that something went wrong.


            Comment


            • #7
              Originally posted by daniel klein View Post
              I think Fei Wang gives very good advice.

              If you



              but did nothing with/conditional on the captured return code, there is a high risk of



              that will mess things ups -- in the worst case without you even noticing that something went wrong.

              In this case, I'm fairly confident the issue that needed "capturing" is that some of the variable labels are long and use common stems. So I got variable already defined errors. Fortunately the variables of interest to me have labels that can fit within the variable name character limit comfortably.

              But, you and Fei Want are almost certainly correct. I think my plan is to go ahead and do all the cleaning/appending manually to create my working data.

              Separately though, I would still like to see if I can automate/avoid the repetitive coding, so will treat it as a bit of a side project and compare the resulting datasets/results, if I can ever get it working. The new issue is that the labels aren't quite as uniform as I thought. Once they've gone through strtoname they are in the fomat A345_menaingful_label. The "A345" bit is always one letter and 1-4 numbers for the substantive variables. I think it calls for regexr, but I'm not sure how to manipulate the variable name (rather than the value).

              Comment


              • #8
                Your description suggest the regex

                Code:
                ^[\w]{1}\d{3}\_(\w+)
                for example

                Code:
                 di ustrregexra("A345_something useful", "^[\w]{1}\d{3}\_(\w+)",  "$1")
                something useful
                Res.:

                Code:
                 di ustrregexra("A345_something useful", "^[\w]{1}\d{3}\_(\w+)",  "$1")
                something useful
                Therefore

                Code:
                foreach var of varlist *{
                cap lab var `var' `=ustrregexra("`:var lab `var''", "^[\w]{1}\d{3}\_(\w+)",  "$1")'
                }
                Last edited by Andrew Musau; 11 Nov 2021, 05:36.

                Comment


                • #9
                  Regular expressions are fine, but the last problem in #7 can be solved without them. The idea is just looking for whatever follows the first underscore.

                  Code:
                  . local given A345_meaningful_label
                  
                  
                  . di strtoname(substr("`given'", 1 + strpos("`given'", "_"), .))
                  meaningful_label

                  Comment


                  • #10
                    Originally posted by Josephine George View Post
                    I'm fairly confident the issue that needed "capturing" is that some of the variable labels are long and use common stems.
                    The problem is that capture does not care whether that is the problem or something else entirely. And, capture will hide all output so cannot know either. You want at least a minimum level of security from making false assumptions here. Something like

                    Code:
                    capture ...
                    if !inlist(_rc, 0, 110) error _rc // some error, other than "already defined"
                    might be a good start.

                    Comment


                    • #11
                      Thanks for all the input. I ended up doing all the wrangling one data set at a time and I'm glad I did as there were definitely things I would have missed trying to put it all in a loop.

                      Comment

                      Working...
                      X