append cross section data sets when id name coding are inconsistent

Zhang_Lu

Join Date: Oct 2014

Posts: 155
#1

append cross section data sets when id name coding are inconsistent

10 Nov 2014, 02:14

Hi, I have several cross-section data which id(here named region) and year variable, I want to append them togather to form a panel. When trying to achieve this goal with --append-- command, I encouter series problem that some observations were lost for certain year(s)t, but they are not really missing observation as I can see them in the raw data. So I check the cross-section data sets seperately and find in the these certain year the value of int type variable region is different from other years (I use --recode-- code command to convert "region" from string to int) .For example,the value of region name "哈尔滨"in 5 of 6 data sets are 65 but in year2012 it's 63 so there is a inconsistency. I don't know what cause the inconsistent, the variable type is uniform for all region names and I didn't give them any (value or variable) label before appending. Using the -xtdes- command, I found about 80% of my observations are balanced with the missing of id names concentrated at two years,2012and2006,see below

Freq. Percent Cum. Pattern
---------------------------+----------
262 80.62 80.62 11111111
19 5.85 86.46 1111111.
10 3.08 89.54 1.111111
6 1.85 91.38 .1111111
5 1.54 92.92 ..111111
3 0.92 93.85 11111.11
2 0.62 94.46 ..1.111.
2 0.62 95.08 11.1...1
1 0.31 95.38 ......1.
15 4.62 100.00 (other patterns)
---------------------------+----------
325 100.00 XXXXXXXX

I highly suspect that this problem is due to some kinds of differences of id names (region in this case) between different cross-section data sets. In principle, the --encode-- command attach unique code to a id name, regardless its relative position. But even I tried to copy the id names of master datasets to substitute the corresponding ones in certain years, the problem remained.That really confuse me. Is that possible that my id names are in chinese characters (b/c they are chinese data) cause this problems ? I don't know. , but it worked just fine in the most cases and I have too many of them so giving id a english names is extremely cubersome
I have been stucked by this problem for 2 days and still have no hope, I feel it may due to some stupid reasons that I cannot find. The worst thing is that I have no one to seek for help nearby. So specialists on statalist,please check it and give me some clue, Thank you in advance. Here I pose the raw data, cross-section data and my appended panel data
Attached Files

nut_growth_2011.dta (10.1 KB, 2 views)

nut_growth_2012.dta (9.7 KB, 2 views)

nut_growth_2009.dta (9.9 KB, 1 view)

nut_growth_2010.dta (9.9 KB, 1 view)

nut_growth_2008.dta (9.9 KB, 1 view)

2012_人口.xls (91.5 KB, 2 views)

2011_人口.xls (90.0 KB, 1 view)

nut_growth_2005_2012.dta (55.8 KB, 2 views)

nut_growth_2006.dta (10.5 KB, 1 view)

nut_growth_2007.dta (10.0 KB, 1 view)

nut_growth_2005.dta (10.5 KB, 1 view)

2006_人口.xls (96.0 KB, 1 view)

2007_人口.xls (91.5 KB, 1 view)

2008_人口.xls (143.5 KB, 1 view)

2010_人口.xls (90.5 KB, 1 view)

2005_人口.xls (95.5 KB, 1 view)

2009_人口.xls (92.0 KB, 1 view)

Last edited by Zhang_Lu; 10 Nov 2014, 02:16.
Tags: None
Zhang_Lu

Join Date: Oct 2014

Posts: 155
#2

10 Nov 2014, 02:18

Here is the corss-section data of two years,the appended panel and two year's raw data
Attached Files

nut_growth_2011.dta (10.1 KB, 2 views)

nut_growth_2012.dta (9.7 KB, 2 views)

nut_growth_2005_2012.dta (55.8 KB, 2 views)

2012_人口.xls (91.5 KB, 2 views)

2012_卫生.xls (133.0 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#3

10 Nov 2014, 07:56

I'm guessing that you built your combined data set by going through the following steps (in pseudo-code):

Code:

for each year 2005 through 2012 { import excel file for that year into Stata encode* the region variable to make it numeric save stata data file for that year } append all the Stata data file together save the result as nut_growth_2005_2012 // *It's -encode-, not -recode-. -recode- changes numeric values but does not convert strings to numbers

The mistake is encoding the region variable separately in each year's file. When -encode- works, it assigns numbers 1, 2, 3, ... to the various values of the string variable being -encode-d in alphabetical order. If the list of region names is not identical in all of the individual files, this will result in the kind of inconsistent assignment of numeric values to names you describe: when working on one file, Stata has no knowledge of what it did in some other file that is no longer in memory. So you need to re-do the building of the cumulative file along these lines:

Code:

// SET UP A TEMPORARY FILE TO SERIALLY APPEND ONE-YEAR FILES clear tempfile building save `building', emptyok // LOOP OVER YEARS, IMPORTING FROM EXCEL // AND APPEND THE FILES INTO THE TEMPORARY FILE // AS WE GO, KEEPING REGION AS A STRING VARIABLE forvalues y = 2005/2012 { import excel `y'_nut_growth.xls, firstrow case(lower) clear // USE ACTUAL .xls FILENAME gen int year = `y' // IF YEAR IS NOT ALREADY IN THE EXCEL FILE append using `building' save "`building'", replace } // NOW ENCODE REGION TO CREATE A NUMERIC // PANEL IDENTIFIER CONSISTENTLY ACROSS YEARS use `building', clear encode region, gen(_region) drop region rename _region region xtset region year save nut_growth_2005_2012, replace

A couple of notes: I did not look at your Excel files, so I assumed that they needed year to be added as a variable and that nothing else needed to be done before putting them into a Stata data set. If the reality is different you will have to modify the code within that -forvalues- loop accordingly. I also did not use the actual names of your .xls files in the -import excel- command because I do not know how to create those Chinese characters on my computer. You will need to use the actual name of your .xls files in this code.
1 like
Comment

Announcement

append cross section data sets when id name coding are inconsistent

Comment

Comment