Appending two datasets but second data is changing labels (adopting wrong labels on its own)

Aditi EPRI

Join Date: Jul 2019

Posts: 24
#1

Appending two datasets but second data is changing labels (adopting wrong labels on its own)

16 Aug 2023, 10:54

Hello all
I am encountering a strange issue while appending two datasets. These do not have any unique identification variable in common.

When I append a DTA file into the main dataset through append using command, it adds the new dataset with same names but different values and labels. I am not sure why this is happening. I ran the appended dataset on its own and it seems fine but when appended to the main data the column changes values..

I have attached two snaps:-
1) Data of second dataset on its own (questions CQ7_1, CQ7_2 etc)
2) Data of second dataset when appended (questions CQ7_1, CQ7_2 etc)

The responses in the columns before and also have also changed/shifted.

Please help me fix this problem.. totally clueless

There are 307 var on main dataset and 251 var on second dataset with total obs 180. I use Stata/BE 17.0

Attached Files
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

16 Aug 2023, 11:22

The problem is that these data sets are not compatible in their current state and should not be put together. The two data sets have different value labels assigned to variables that have the same name. When you try to put them together, Stata ignores the revised label definitions in the -using- data set and applies the label definitions of the master data set throughout the combined result.

So you need to make these data sets compatible before putting them together. How you do this will depend on how these data sets were created in the first place. Frankly, looking at the screenshots you showed, it looks to me like the problem may be even more serious than just different labeling schemes. Many of the variables look like they have nothing to do with the variable of the same name in the other data set. Take c_q7_7 as an example: in the first screen shot it looks like a variable having something to do with health, whereas in the second it looks like a yes-no variable. What does c_q7_7 actually represent in each data set? (I singled out c_q7_7 as an example, but many of the variables exhibit a similar apparent unrelatedness.)

To figure this all out you will have to sit down with the data dictionaries of these data sets, determine whether the variables with the same names actually represent the same thing in both or not. If not, then you will need to change the variable names in one of the data sets or develop some kind of crosswalk showing which variable in data set 1 corresponds to which variable in data set 2. You will probably also need to remove the value labels from both data sets and develop a new, consistent set of value labels to apply to both. (Probably best to create that consistent set of value labels in the combined data set after you fix up everything else and -append- them.)

By the way, this problem arises often. I recommend instaslling Mark Chatfield's -precombine- command, available from SSC, and using it whenever you plan to combine two data sets. It will review the data sets for incompatibilities and alert you to them so you can stop and fix them first.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#3

16 Aug 2023, 11:37

one correction to Clyde Schechter 's otherwise excellent advice: -precombine- is not available from SSC; the easiest way to find (it's from SJ) is to use -search-
1 like
Comment
Aditi EPRI

Join Date: Jul 2019

Posts: 24
#4

16 Aug 2023, 13:29

Clyde Schechter it is true that two datasets have same variables but they have different names. For instance, main dataset has P_Q12_1 which is the same as C_Q7_1 of the appended (Second) dataset. All variables of C_Q7 are yes/no type (Similarly P_Q12 of the main dataset). For sure there is no variable of the same name across two datasets - they have all been prefixed also differently.

Maybe there is another solution to it?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

16 Aug 2023, 13:48

OK, so this is even worse than it appeared at first. There is no way for Stata to know that P_Q12_1 in one data set corresponds to C_Q7_1 in the other. You will have to rename the variables in one (or both) of the data sets to assure that the same variable has the same name in both data sets.

Then you have the problem of value labels. Do install and run -precombine- so that you can get an inventory of the conflicting variable labels. You will then have to eliminate those from both data sets and replace them with a consistent set of value labels.

I fear that actually this second step concerning value labels will be even more problematic. You need to review the actual variable coding in both data sets to be sure that the same numeric response (without the attached value label) to the same variable means the same thing in both data sets. If that is not the case, you will need to recode the variables so they are harmonized.

It sounds a lot like whoever originally created these two data sets did not have the possibility that somebody would later want to combine them in mind. They certainly made it very difficult to do so. Harmonizing these data sets is going to be a lot of time-consuming, tedious, error-prone work that will require meticulous attention to detail.
Comment
Aditi EPRI

Join Date: Jul 2019

Posts: 24
#6

17 Aug 2023, 05:06

Thanks Clyde Schechter for explaining the process.
Comment

Announcement

Appending two datasets but second data is changing labels (adopting wrong labels on its own)

Comment

Comment

Comment

Comment

Comment