Combining two datasets with different languages

Nicole Hameister

Join Date: Mar 2023

Posts: 4
#1

Combining two datasets with different languages

22 Mar 2023, 01:42

Hi everyone,

I have two datasets which are identical in terms of number of observations and variables as well as variable names. They only differ in their variable and value label language. Is there a smart way to combine them into one data set where I could switch between both label sets with the label language command? I want to avoid creating a whole new label language set manually, but I'm not sure that's possible.

Best, Nicole
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10188

22 Mar 2023, 02:19

I think you will have to clone the key variable in the using dataset to preserve the label. See

Code:

help clonevar

elabel from SSC by daniel klein may have a more straightforward solution that I am not aware of. He may chip in once he sees this thread.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(var1 var2)
0 1
1 0
end
label values var1 english
label def english 0 "zero", modify
label def english 1 "one", modify

clonevar var1E=var1
tempfile dataset1
save `dataset1'

* Example generated by -dataex-. To install: ssc install dataex
clear
input float var1 str3 var3
0 "x"
1 "y"
end
label values var1 norsk
label def norsk 0 "null", modify
label def norsk 1 "en", modify


merge 1:1 var1 using `dataset1', nogen
lab list

Res.:

Code:

. lab list
english:
           0 zero
           1 one
norsk:
           0 null
           1 en

Last edited by Andrew Musau; 22 Mar 2023, 02:23.

Comment

daniel klein

Join Date: Mar 2014
Posts: 3844

22 Mar 2023, 04:35

Originally posted by Nicole Hameister View Post

I have two datasets which are identical in terms of number of observations and variables as well as variable names. They only differ in their variable and value label language.

The more relevant question is whether the datasets are identical in terms of values, too. And, whether value labels (and variable labels) are consistent. Drawing on Andrew's example, if value 0 is mapped to "zero" in English, it should not be mapped to "en" in Norsk. Chatfield (2015) might be relevant. Also, see Golbe (2010).

If all you want to do is copy the label language definition from one dataset to another, elabel (SSC or GitHub) can help:

Code:

// Toy dateset with labels in Norsk
clear
input float (var1 var2)
0 1
1 0
end
label values var1 norsk
label def norsk 0 "null", modify
label def norsk 1 "en", modify

label variable var1 "Varibale en"
label variable var2 "Variable to"
label language nn , rename

// save the label language definition in a do-file
tempfile nn
elabel language nn , saving("`nn'")

// this is how the file looks like
type `nn'


// Toy dataset with labels in English
clear
input float(var1 var2)
0 1
1 0
end
label values var1 english
label def english 0 "zero", modify
label def english 1 "one", modify

label variable var1 "Varibale one"
label variable var2 "Variable two"
label language en , rename

// bring in the Norsk labels
do `nn'

// confirm desired result
label language
label list

Here is the result

Code:

(output omitted)
. type `nn'
label language nn, new
label data `""'
label variable var1 `"Varibale en"'
label values var1 norsk
label define norsk 0 `"null"', modify
label define norsk 1 `"en"', modify
label variable var2 `"Variable to"'
label values var2

(output omitted)
Language for variable and value labels

    Available languages:
            en
            nn

    Currently set is:               . label language nn

    To select different language:   . label language <name>

    To create new language:         . label language <name>, new
    To rename current language:     . label language <name>, rename

. label list
norsk:
           0 null
           1 en
english:
           0 zero
           1 one

I noticed that I never bothered documenting this feature; perhaps I thought of it as too esoteric to ever come up in real-life situations.

Chatfield, M. D. 2015. precombine: A command to examine n ≥ 2 datasets before combining. The Stata Journal,15(3), pp. 607–626.
Golbe, D. L. 2010. Stata tip 83: Merging multilingual datasets. The Stata Journal,10(1), pp. 152–156.

Last edited by daniel klein; 22 Mar 2023, 04:43. Reason: added references

Comment

Nicole Hameister

Join Date: Mar 2023

Posts: 4
#4

22 Mar 2023, 09:45

Thank you Andrew and Daniel! I've checked both your suggestions and neither seem to solve my issue though.

My two datasets (English and German) are completely identical, in terms of values, number of observations etc. Also, variable and value labels are identical - in the German data set, the variable sex has the German var label "Geschlecht" and the value label is "sex", while in English the var label is "Gender" and the value label is "sex" as well. This seems to be exactly the problem, that value labels have the exact same name in both datasets: I applied Daniel's complete code to my datasets and one language set overwrote the other - here is the result:

. label language

Language for variable and value labels

Available languages:
de
en

Currently set is: . label language de

To select different language: . label language <name>

To create new language: . label language <name>, new
To rename current language: . label language <name>, rename

. label list
sex:
1 1. männlich
2 2. weiblich

.
end of do-file
So how do I combine two language sets with identical value label names for a dataset of >400 variables without redefining the value labels manually for each variable, so that one language doesn't overwrite the labels from the other dataset but I end up with two language sets?

I'm not sure I'm making myself very clear here, so I apologise in advance for any confusion I might cause.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2400
#5

22 Mar 2023, 09:59

I don't know of an obvious way to broadly combine language sets for metadata such as variable names and labels. One could think of storing this information in the data or variable -char-acteristics, but this information would not be readily used or useful to Stata's data management commands.

In this specific case where the data, variables, labels and concepts map perfectly between languages, you might consider something along the lines of holding all of your metadata in an Excel sheet (for example), and defining a custom program that will read in the metadata for a given langauge from this Excel sheet and handles all of the variable renaming, variable and value labeling based on its contents. You would end up with an English or German dataset in the end (or both, if you desire).
Comment
Nicole Hameister

Join Date: Mar 2023

Posts: 4
#6

23 Mar 2023, 02:14

Originally posted by Leonardo Guizzetti View Post

I don't know of an obvious way to broadly combine language sets for metadata such as variable names and labels. One could think of storing this information in the data or variable -char-acteristics, but this information would not be readily used or useful to Stata's data management commands.

In this specific case where the data, variables, labels and concepts map perfectly between languages, you might consider something along the lines of holding all of your metadata in an Excel sheet (for example), and defining a custom program that will read in the metadata for a given langauge from this Excel sheet and handles all of the variable renaming, variable and value labeling based on its contents. You would end up with an English or German dataset in the end (or both, if you desire).

Thank you Leonardo, this is what I suspected: that I'll have to do it in an unelegant way
Comment
daniel klein

Join Date: Mar 2014

Posts: 3844
#7

23 Mar 2023, 04:06

Originally posted by Nicole Hameister View Post

My two datasets (English and German) are completely identical, in terms of values, number of observations etc. Also, variable and value labels are identical

It seems you want to say that the value label names are identical, right? Neither value label contents, i.e., labels, nor variable labels can be identical (except for the occasional cases where German and English actually use the same words). By the way, many social scientists, even non-radical-left-wingers, would argue that sex and gender are not the same, but this discussion does not seem relevant here. However, you really want to be sure about the precise setup of your data. By that I mean your assertion of what is and what is not identical should be based on more than a quick glance.

Originally posted by Nicole Hameister View Post

I'm not sure I'm making myself very clear here

A data example might be better than a verbal description. In this case, metadata might suffice, meaning that you may restrict the datasets to one observation and recode all values to missing. Anyway, assuming that the value label names are indeed identical, the following code should help with the value labels

Code:

// Load the English dataset use /* <en>.dta */ // Add a suffix to all value label names elabel rename * =_en // Save the value label definitions tempfile labels_en label save using "`labels_en'" // Now load the German dataset use /* <de>.dta */ // Add a suffix to all value label names elabel rename * =_de // Save the entire German label language tempfile language_de elabel language , saving("`language_de'") // Drop all German value label definitions(!) elabel drop *_de // Rename the still attached German value label names(!) elabel rename *_de *_en , nomemory // Now define the English labels do "`labels_en'" // Rename the label language label language en , rename // Bring back the German label language do "`language_de'"

The variable labels are a bit more complicated because variable names seem to differ. If variables are in the same order in both datasets, then you might get away with something along the lines

Code:

// Load the English data, again use /* <en>.dta */ // Collect the variable labels local k 0 foreach var of varlist _all { local ++k local variable_label_`k' : variable label `var' } // Now load the multilingual, former German, dataset use /* <de>.dta */ // Switch to English label language label language en // Attach the variable labels local k 0 foreach var of varlist _all { local ++k label variable `var' `"`variable_label_`k''"' }
Comment
daniel klein

Join Date: Mar 2014

Posts: 3844
#8

23 Mar 2023, 04:19

Originally posted by Nicole Hameister View Post

I applied Daniel's complete code to my datasets and one language set overwrote the other - here is the result:

Sorry, I must have missed this one. Are you saying that you do not get an error from my code in #3? In that case, variable names would also be identical. If so, the whole thing boils down to:

Code:

// Load the English data use /* <en>.dta */ // Add a suffix to the value label names elabel rename * *_en // Now save the label language definition in a do-file tempfile language_en elabel language en , saving("`language_en'") // Load the German data use /* <de>.dta */ // Optionally, add a suffix to the value label names elabel rename * *_de // Add the English label language do "`language_en'"

Last edited by daniel klein; 23 Mar 2023, 04:21.
Comment
Nicole Hameister

Join Date: Mar 2023

Posts: 4
#9

23 Mar 2023, 11:08

Originally posted by daniel klein View Post

Sorry, I must have missed this one. Are you saying that you do not get an error from my code in #3? In that case, variable names would also be identical. If so, the whole thing boils down to:

Code:

// Load the English data use /* <en>.dta */ // Add a suffix to the value label names elabel rename * *_en // Now save the label language definition in a do-file tempfile language_en elabel language en , saving("`language_en'") // Load the German data use /* <de>.dta */ // Optionally, add a suffix to the value label names elabel rename * *_de // Add the English label language do "`language_en'"

Wow, thank you Daniel, this actually has done the trick! I have both language sets now conveniently stored in one dataset. This is really helpful for cases like mine where you receive two SPSS datasets in different languages and want to combine them into one Stata data set. Great!!

(Btw, yes, I'll rename sex to gender.)
Comment

Announcement