Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining two datasets with different languages

    Hi everyone,

    I have two datasets which are identical in terms of number of observations and variables as well as variable names. They only differ in their variable and value label language. Is there a smart way to combine them into one data set where I could switch between both label sets with the label language command? I want to avoid creating a whole new label language set manually, but I'm not sure that's possible.

    Best, Nicole

  • #2
    I think you will have to clone the key variable in the using dataset to preserve the label. See

    Code:
    help clonevar
    elabel from SSC by daniel klein may have a more straightforward solution that I am not aware of. He may chip in once he sees this thread.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(var1 var2)
    0 1
    1 0
    end
    label values var1 english
    label def english 0 "zero", modify
    label def english 1 "one", modify
    
    clonevar var1E=var1
    tempfile dataset1
    save `dataset1'
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float var1 str3 var3
    0 "x"
    1 "y"
    end
    label values var1 norsk
    label def norsk 0 "null", modify
    label def norsk 1 "en", modify
    
    
    merge 1:1 var1 using `dataset1', nogen
    lab list
    Res.:

    Code:
    . lab list
    english:
               0 zero
               1 one
    norsk:
               0 null
               1 en
    Last edited by Andrew Musau; 22 Mar 2023, 02:23.

    Comment


    • #3
      Originally posted by Nicole Hameister View Post
      I have two datasets which are identical in terms of number of observations and variables as well as variable names. They only differ in their variable and value label language.
      The more relevant question is whether the datasets are identical in terms of values, too. And, whether value labels (and variable labels) are consistent. Drawing on Andrew's example, if value 0 is mapped to "zero" in English, it should not be mapped to "en" in Norsk. Chatfield (2015) might be relevant. Also, see Golbe (2010).

      If all you want to do is copy the label language definition from one dataset to another, elabel (SSC or GitHub) can help:

      Code:
      // Toy dateset with labels in Norsk
      clear
      input float (var1 var2)
      0 1
      1 0
      end
      label values var1 norsk
      label def norsk 0 "null", modify
      label def norsk 1 "en", modify
      
      label variable var1 "Varibale en"
      label variable var2 "Variable to"
      label language nn , rename
      
      // save the label language definition in a do-file
      tempfile nn
      elabel language nn , saving("`nn'")
      
      // this is how the file looks like
      type `nn'
      
      
      // Toy dataset with labels in English
      clear
      input float(var1 var2)
      0 1
      1 0
      end
      label values var1 english
      label def english 0 "zero", modify
      label def english 1 "one", modify
      
      label variable var1 "Varibale one"
      label variable var2 "Variable two"
      label language en , rename
      
      // bring in the Norsk labels
      do `nn'
      
      // confirm desired result
      label language
      label list
      Here is the result

      Code:
      (output omitted)
      . type `nn'
      label language nn, new
      label data `""'
      label variable var1 `"Varibale en"'
      label values var1 norsk
      label define norsk 0 `"null"', modify
      label define norsk 1 `"en"', modify
      label variable var2 `"Variable to"'
      label values var2
      
      (output omitted)
      Language for variable and value labels
      
          Available languages:
                  en
                  nn
      
          Currently set is:               . label language nn
      
          To select different language:   . label language <name>
      
          To create new language:         . label language <name>, new
          To rename current language:     . label language <name>, rename
      
      . label list
      norsk:
                 0 null
                 1 en
      english:
                 0 zero
                 1 one

      I noticed that I never bothered documenting this feature; perhaps I thought of it as too esoteric to ever come up in real-life situations.


      Chatfield, M. D. 2015. precombine: A command to examine n ≥ 2 datasets before combining. The Stata Journal,15(3), pp. 607–626.
      Golbe, D. L. 2010. Stata tip 83: Merging multilingual datasets. The Stata Journal,10(1), pp. 152–156.
      Last edited by daniel klein; 22 Mar 2023, 04:43. Reason: added references

      Comment


      • #4
        Thank you Andrew and Daniel! I've checked both your suggestions and neither seem to solve my issue though.

        My two datasets (English and German) are completely identical, in terms of values, number of observations etc. Also, variable and value labels are identical - in the German data set, the variable sex has the German var label "Geschlecht" and the value label is "sex", while in English the var label is "Gender" and the value label is "sex" as well. This seems to be exactly the problem, that value labels have the exact same name in both datasets: I applied Daniel's complete code to my datasets and one language set overwrote the other - here is the result:

        . label language

        Language for variable and value labels

        Available languages:
        de
        en

        Currently set is: . label language de

        To select different language: . label language <name>

        To create new language: . label language <name>, new
        To rename current language: . label language <name>, rename

        . label list
        sex:
        1 1. männlich
        2 2. weiblich

        .
        end of do-file
        So how do I combine two language sets with identical value label names for a dataset of >400 variables without redefining the value labels manually for each variable, so that one language doesn't overwrite the labels from the other dataset but I end up with two language sets?

        I'm not sure I'm making myself very clear here, so I apologise in advance for any confusion I might cause.

        Comment


        • #5
          I don't know of an obvious way to broadly combine language sets for metadata such as variable names and labels. One could think of storing this information in the data or variable -char-acteristics, but this information would not be readily used or useful to Stata's data management commands.

          In this specific case where the data, variables, labels and concepts map perfectly between languages, you might consider something along the lines of holding all of your metadata in an Excel sheet (for example), and defining a custom program that will read in the metadata for a given langauge from this Excel sheet and handles all of the variable renaming, variable and value labeling based on its contents. You would end up with an English or German dataset in the end (or both, if you desire).

          Comment


          • #6
            Originally posted by Leonardo Guizzetti View Post
            I don't know of an obvious way to broadly combine language sets for metadata such as variable names and labels. One could think of storing this information in the data or variable -char-acteristics, but this information would not be readily used or useful to Stata's data management commands.

            In this specific case where the data, variables, labels and concepts map perfectly between languages, you might consider something along the lines of holding all of your metadata in an Excel sheet (for example), and defining a custom program that will read in the metadata for a given langauge from this Excel sheet and handles all of the variable renaming, variable and value labeling based on its contents. You would end up with an English or German dataset in the end (or both, if you desire).
            Thank you Leonardo, this is what I suspected: that I'll have to do it in an unelegant way

            Comment


            • #7
              Originally posted by Nicole Hameister View Post
              My two datasets (English and German) are completely identical, in terms of values, number of observations etc. Also, variable and value labels are identical
              It seems you want to say that the value label names are identical, right? Neither value label contents, i.e., labels, nor variable labels can be identical (except for the occasional cases where German and English actually use the same words). By the way, many social scientists, even non-radical-left-wingers, would argue that sex and gender are not the same, but this discussion does not seem relevant here. However, you really want to be sure about the precise setup of your data. By that I mean your assertion of what is and what is not identical should be based on more than a quick glance.

              Originally posted by Nicole Hameister View Post
              I'm not sure I'm making myself very clear here
              A data example might be better than a verbal description. In this case, metadata might suffice, meaning that you may restrict the datasets to one observation and recode all values to missing. Anyway, assuming that the value label names are indeed identical, the following code should help with the value labels

              Code:
              // Load the English dataset
              use /* <en>.dta */
              
              // Add a suffix to all value label names
              elabel rename * =_en
              
              // Save the value label definitions
              tempfile labels_en
              label save using "`labels_en'"
              
              // Now load the German dataset
              use /* <de>.dta */
              
              // Add a suffix to all value label names
              elabel rename * =_de
              
              // Save the entire German label language
              tempfile language_de
              elabel language , saving("`language_de'")
              
              // Drop all German value label definitions(!)
              elabel drop *_de
              
              // Rename the still attached German value label names(!)
              elabel rename *_de *_en , nomemory
              
              // Now define the English labels
              do "`labels_en'"
              
              // Rename the label language
              label language en , rename
              
              // Bring back the German label language
              do "`language_de'"

              The variable labels are a bit more complicated because variable names seem to differ. If variables are in the same order in both datasets, then you might get away with something along the lines

              Code:
              // Load the English data, again
              use /* <en>.dta */
              
              // Collect the variable labels
              local k 0
              foreach var of varlist _all {
                  local ++k
                  local variable_label_`k' : variable label `var'
              }
              
              // Now load the multilingual, former German, dataset
              use /* <de>.dta */
              
              // Switch to English label language
              label language en
              
              // Attach the variable labels
              local k 0
              foreach var of varlist _all {
                  local ++k
                  label variable `var' `"`variable_label_`k''"'
              }

              Comment


              • #8
                Originally posted by Nicole Hameister View Post
                I applied Daniel's complete code to my datasets and one language set overwrote the other - here is the result:
                Sorry, I must have missed this one. Are you saying that you do not get an error from my code in #3? In that case, variable names would also be identical. If so, the whole thing boils down to:

                Code:
                // Load the English data
                use /* <en>.dta */
                
                // Add a suffix to the value label names
                elabel rename * *_en
                
                // Now save the label language definition in a do-file
                tempfile language_en
                elabel language en , saving("`language_en'")
                
                // Load the German data
                use /* <de>.dta */
                
                // Optionally, add a suffix to the value label names
                elabel rename * *_de
                
                // Add the English label language
                do "`language_en'"
                Last edited by daniel klein; 23 Mar 2023, 04:21.

                Comment


                • #9
                  Originally posted by daniel klein View Post

                  Sorry, I must have missed this one. Are you saying that you do not get an error from my code in #3? In that case, variable names would also be identical. If so, the whole thing boils down to:

                  Code:
                  // Load the English data
                  use /* <en>.dta */
                  
                  // Add a suffix to the value label names
                  elabel rename * *_en
                  
                  // Now save the label language definition in a do-file
                  tempfile language_en
                  elabel language en , saving("`language_en'")
                  
                  // Load the German data
                  use /* <de>.dta */
                  
                  // Optionally, add a suffix to the value label names
                  elabel rename * *_de
                  
                  // Add the English label language
                  do "`language_en'"
                  Wow, thank you Daniel, this actually has done the trick! I have both language sets now conveniently stored in one dataset. This is really helpful for cases like mine where you receive two SPSS datasets in different languages and want to combine them into one Stata data set. Great!!

                  (Btw, yes, I'll rename sex to gender.)

                  Comment

                  Working...
                  X