Rename variables with labels when labels are multiple words?

Josephine George

Join Date: Dec 2018

Posts: 34
#1

Rename variables with labels when labels are multiple words?

11 Nov 2021, 03:48

Background/big problem

I have about 60 large data sets, each a different round of a repeated survey. I need to construct a single pooled data set containing only a handful of variables.

Some variables (weights, identifiers, demographics) are in every data set and some (substantive questions) are only in some.

The way the weights are stored in the data changed a couple of times, so I need to do slightly different manipulations for data sets from different periods.

The substantive variables I care about have identical labels across the data sets, but completely different names. Other variables have 2-3 variations of both the name and the label.

All the data sets live in the same folder, along with other data sets (additional rounds of the same survey) that don't have any of the substantive variables I need.

I've started doing this the long way (open each file, do the necessary wrangling, save as a new file, with the intention of appending at the end). But there must be a way of doing it in a loop -- maybe I can leverage the nearly-consistent labels?

Specific problem

I am trying to rename my variables to their variable labels. My code is:

Code:

foreach var of varlist _all { local lab: variable lab `var' rename `var' `lab' }

This gives an r198 error, along with the syntax for rename. I suspect the issue is that the labels are multiple words, but I'm not sure. How do I get around that? Something using subinstr to the labels before renaming?
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10281
#2

11 Nov 2021, 04:20

The naming rules for labels and variable names are quite different. Almost anything goes for labels, but there are a number of restrictions for variable names. See

Code:

help naming_conventions

For you specific problem, one way is to pass the label through -strtoname()-.

Code:

foreach var of varlist *{ rename `var' `=strtoname("`:var lab `var''")' }

Last edited by Andrew Musau; 11 Nov 2021, 04:25.
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#3

11 Nov 2021, 04:38

Josephine, I'm afraid most tasks may not be done with loops as variable names and values may be arbitrarily inconsistent across surveys. I would do the following step by step.

First, list all variables of all surveys that need to be kept from original datasets in, for example, an Excel file, including their original names and rules of value assignment (the latter is important, for example, "gender" is 1 and 2 in some survey but 0 and 1 in other surveys.), as well as the final uniform names and value assignment you'd like to set. Based on my own experience, such a file makes me very clear about the whole picture and where I'm heading for.

Second, clean the data one by one. My first move of cleaning a data is always to keep substantive variables only. And then rename and revalue variables based on the uniform rules in the file from step one -- I don't think renaming variables to their variable labels is a good idea as variable labels may be different across surveys.

Third, after the preliminary cleaning in step two (uniform names and value assignment now), append all 60 datasets with a couple of lines of code.

Last but not least, clean the appended data more deeply, including setting new variable and value labels, correcting errors, etc.

Of course, all steps above except the first need to be implemented with do files.
2 likes
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#4

11 Nov 2021, 04:39

Thanks Andrew. I had to add a "capture" but otherwise that worked perfectly!
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#5

11 Nov 2021, 04:42

Originally posted by Fei Wang View Post

Josephine, I'm afraid most tasks may not be done with loops as variable names and values may be arbitrarily inconsistent across surveys. I would do the following step by step.

First, list all variables of all surveys that need to be kept from original datasets in, for example, an Excel file, including their original names and rules of value assignment (the latter is important, for example, "gender" is 1 and 2 in some survey but 0 and 1 in other surveys.), as well as the final uniform names and value assignment you'd like to set. Based on my own experience, such a file makes me very clear about the whole picture and where I'm heading for.

Second, clean the data one by one. My first move of cleaning a data is always to keep substantive variables only. And then rename and revalue variables based on the uniform rules in the file from step one -- I don't think renaming variables to their variable labels is a good idea as variable labels may be different across surveys.

Third, after the preliminary cleaning in step two (uniform names and value assignment now), append all 60 datasets with a couple of lines of code.

Last but not least, clean the appended data more deeply, including setting new variable and value labels, correcting errors, etc.

Of course, all steps above except the first need to be implemented with do files.

Yeah, you are probably right. I feel like I should be able to avoid the deeply repetitive coding for cleaning each data set, but as you point out, there are potentially hidden hazards.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#6

11 Nov 2021, 04:52

I think Fei Wang gives very good advice.

If you

Originally posted by Josephine George View Post

had to add a "capture"

but did nothing with/conditional on the captured return code, there is a high risk of

Originally posted by Josephine George View Post

potentially hidden hazards

that will mess things ups -- in the worst case without you even noticing that something went wrong.
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#7

11 Nov 2021, 05:11

Originally posted by daniel klein View Post

I think Fei Wang gives very good advice.

If you

but did nothing with/conditional on the captured return code, there is a high risk of

that will mess things ups -- in the worst case without you even noticing that something went wrong.

In this case, I'm fairly confident the issue that needed "capturing" is that some of the variable labels are long and use common stems. So I got variable already defined errors. Fortunately the variables of interest to me have labels that can fit within the variable name character limit comfortably.

But, you and Fei Want are almost certainly correct. I think my plan is to go ahead and do all the cleaning/appending manually to create my working data.

Separately though, I would still like to see if I can automate/avoid the repetitive coding, so will treat it as a bit of a side project and compare the resulting datasets/results, if I can ever get it working. The new issue is that the labels aren't quite as uniform as I thought. Once they've gone through strtoname they are in the fomat A345_menaingful_label. The "A345" bit is always one letter and 1-4 numbers for the substantive variables. I think it calls for regexr, but I'm not sure how to manipulate the variable name (rather than the value).
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10281

11 Nov 2021, 05:32

Your description suggest the regex

Code:

^[\w]{1}\d{3}\_(\w+)

for example

Code:

 di ustrregexra("A345_something useful", "^[\w]{1}\d{3}\_(\w+)",  "$1")
something useful

Res.:

Code:

 di ustrregexra("A345_something useful", "^[\w]{1}\d{3}\_(\w+)",  "$1")
something useful

Therefore

Code:

foreach var of varlist *{
cap lab var `var' `=ustrregexra("`:var lab `var''", "^[\w]{1}\d{3}\_(\w+)",  "$1")'
}

Last edited by Andrew Musau; 11 Nov 2021, 05:36.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35779
#9

11 Nov 2021, 05:33

Regular expressions are fine, but the last problem in #7 can be solved without them. The idea is just looking for whatever follows the first underscore.

Code:

. local given A345_meaningful_label . di strtoname(substr("`given'", 1 + strpos("`given'", "_"), .)) meaningful_label
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#10

11 Nov 2021, 06:37

Originally posted by Josephine George View Post

I'm fairly confident the issue that needed "capturing" is that some of the variable labels are long and use common stems.

The problem is that capture does not care whether that is the problem or something else entirely. And, capture will hide all output so cannot know either. You want at least a minimum level of security from making false assumptions here. Something like

Code:

capture ... if !inlist(_rc, 0, 110) error _rc // some error, other than "already defined"

might be a good start.
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#11

18 Nov 2021, 04:01

Thanks for all the input. I ended up doing all the wrangling one data set at a time and I'm glad I did as there were definitely things I would have missed trying to put it all in a loop.
Comment

Announcement

Rename variables with labels when labels are multiple words?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment