My appended dataset displays labels from the first dataset

Julia Simon

Join Date: Apr 2022

Posts: 37
#1

My appended dataset displays labels from the first dataset

21 Jul 2022, 03:47

Dear Statalisters,

All my .dta files (one per country) are stored in the same folder. I am using this command to append them all:

Code:

cd "$input" local append: dir "." files "*.dta" *log using session1 *precombine `append' // precombine is from SSC *log close append using `append', force

But once I have my appended dataset, the variable that is supposed to denote the region of the surveyed individual gets label from the first dataset, i.e. the first country. Please have a look at my data:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int a1 float a2 12 2 12 2 12 1 12 1 12 1 12 1 12 2 12 1 12 1 12 1 12 2 12 1 12 2 12 1 12 1 12 1 12 2 12 2 12 1 12 1 end label values a1 A1 label values a2 a2 label def a2 1 "Tirana", modify label def a2 2 "Durres and Shkoder", modify

As you can see, a2 have labels from regions that are located in Albania, however 12 is the country code for a country that isn't Albania. I suspect this is due to the option "force", but if I do not write it, I get an error message saying that my variables do not have the same format across datasets. Do I have to manually format every single variable? What should I do to have an appended dataset that displays the appropriate label?
Tags: None
Fei Wang

Join Date: Oct 2021

Posts: 726
#2

21 Jul 2022, 04:12

Julia, it would be helpful to display two data examples (from two countries, respectively) before appending. So that we would be able to understand the sources of the problems.
Comment

Julia Simon

Join Date: Apr 2022
Posts: 37

21 Jul 2022, 04:44

Fei: You are right. Here are two data examples from Albania and Azerbaijan:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(a1 a2)
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 2
44 4
44 4
44 4
44 4
44 4
44 4
44 4
44 4
44 4
44 4
end
label values a1 A1
label def A1 44 "Albania", modify
label values a2 a2
label def a2 2 "Durres and Shkoder", modify
label def a2 4 "Elbasan and Korce", modify

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(a1 a2)
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
65 1
end
label values a1 A1
label def A1 65 "Azerbaijan", modify
label values a2 a2
label def a2 1 "Baku & Apsheronski", modify

In the appended dataset, here are the labels found for Azerbaijan:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int a1 float a2
65 2
65 1
65 3
65 2
65 1
65 3
65 1
65 4
65 3
65 4
65 1
65 2
65 3
65 2
65 3
65 2
65 3
65 1
65 3
65 3
end
label values a1 A1
label def A1 65 "Azerbaijan", modify
label values a2 a2
label def a2 1 "Tirana", modify
label def a2 2 "Durres and Shkoder", modify
label def a2 3 "Fier and Vlore", modify
label def a2 4 "Elbasan and Korce", modify

As you can see these are the labels found in the Albanian dataset (which is the first one in the list of files).

Comment

Fei Wang

Join Date: Oct 2021

Posts: 726
#4

21 Jul 2022, 05:05

Julia, to my understanding, a value of "a2" does not uniquely identify a district. For example, there may be 1's in every country's data, but they refer to different districts. In that sense, you cannot append the datasets with "a2" being numeric. My solution would be transforming "a2" of every country into a string variable before appending, like

Code:

decode a2, generate(a2_str) drop a2
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3911
#5

21 Jul 2022, 05:21

Fei Wang gives good advice. Chances are, the numeric values in each dataset were created using encode. Because encode always assigns values 1, 2, ... to the sorted list of strings (areas in this case), the numeric values depend on which areas were observed in each dataset. The best way to combine the datasets is to decode the variables in all datasets, combine the datasets, and then let encode create a value label for the combined variable in the final dataset.
1 like
Comment

Announcement

My appended dataset displays labels from the first dataset

Comment

Comment

Comment

Comment