Merging using data with repeated cross section master data by changing long to wide format of variables

Tariq Abdullah

Join Date: Apr 2021
Posts: 366

Merging using data with repeated cross section master data by changing long to wide format of variables

29 Jul 2022, 03:50

I need to merge master data (which is repeated cross section data - observations of couple of variables for each county at 4 different quarter for a particular year) with repeated cross section of my using data. The format for my using data is given here :

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int year float county byte(race hispanic gender) float(age pop)
1990 1001 1 0 1  5  882
1990 1001 1 0 1  6 1059
1990 1001 1 0 1  7 1148
1990 1001 1 0 1  8 1059
1990 1001 1 0 1  9 1051
1990 1001 1 0 1 10  876
1990 1001 1 0 1 11  792
1990 1001 1 0 1 12  670
1990 1001 1 0 1 13  519
1990 1001 1 0 2  5  824
1990 1001 1 0 2  6 1096
1990 1001 1 0 2  7 1199
1990 1001 1 0 2  8 1105
1990 1001 1 0 2  9 1093
1990 1001 1 0 2 10  886
1990 1001 1 0 2 11  814
1990 1001 1 0 2 12  671
1990 1001 1 0 2 13  541
1990 1001 1 1 1  5    4
1990 1001 1 1 1  6    6
1990 1001 1 1 1  7   10
1990 1001 1 1 1  8    8
1990 1001 1 1 1  9    5
1990 1001 1 1 1 10    4
1990 1001 1 1 1 11    3
1990 1001 1 1 1 12    4
1990 1001 1 1 1 13    4
1990 1001 1 1 2  5    4
1990 1001 1 1 2  6    7
1990 1001 1 1 2  7   11
1990 1001 4 0 2  9    5
1990 1001 4 0 2 10    7
1990 1001 4 0 2 11   14
1990 1001 4 0 2 12    9
1990 1001 4 0 2 13    6
1990 1001 4 1 2  7    1
1990 1003 1 0 1  5 2308
1990 1003 1 0 1  6 2822
1990 1003 1 0 1  7 3148
1990 1003 1 0 1  8 3128
1990 1003 1 0 1  9 3146
end
label values race race
label def race 1 "White", modify
label def race 2 "Black", modify
label def race 3 "American Indian/Alaska Native (1990+)", modify
label def race 4 "Asian or Pacific Islander (1990+)", modify
label values hispanic hispanic
label def hispanic 0 "Non-Hispanic", modify
label def hispanic 1 "Hispanic", modify
label values gender sex
label def sex 1 "Male", modify
label def sex 2 "Female", modify
label values age age
label def age 5 "20-24 years", modify
label def age 6 "25-29 years", modify
label def age 7 "30-34 years", modify
label def age 8 "35-39 years", modify
label def age 9 "40-44 years", modify
label def age 10 "45-49 years", modify
label def age 11 "50-54 years", modify
label def age 12 "55-59 years", modify
label def age 13 "60-64 years", modify

My master data is like the following

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int year byte qtr str5 county str6 industry_code long avg_wkly_wage
2000 1 "01000" "10"      555
2000 2 "01000" "10"      545
2000 3 "01000" "10"      546
2000 4 "01000" "10"      586
2000 1 "01000" "10"      868
2000 2 "01000" "10"      839
2000 3 "01000" "10"      952
2000 4 "01000" "10"      902
2000 1 "01000" "101"     713
2000 2 "01000" "101"     703
2000 3 "01000" "101"     914
2000 4 "01000" "101"     765
2000 1 "01000" "1013"    713
2000 2 "01000" "1013"    703
2000 3 "01000" "1013"    914
2000 4 "01000" "1013"    765
2000 1 "01000" "102"     877
2000 2 "01000" "102"     846
1990 1 "01000" "33299"   713
1990 2 "01000" "33299"   703
1990 3 "01000" "33299"   914
1990 4 "01000" "33299"   765
1990 1 "01000" "332993"  713
1990 2 "01000" "332993"  703
1990 3 "01000" "332993"  914
1990 4 "01000" "332993"  765
1990 1 "01000" "44-45"   317
1990 2 "01000" "44-45"   318
1990 3 "01000" "44-45"   334
1990 4 "01000" "44-45"   356
1990 1 "01001" "445"     525
1990 2 "01001" "445"     481
1990 3 "01001" "445"     611
1990 4 "01001" "445"     519
end

I figure that I need to create the number of population for each age group, race and gender in a wide format for each year and a county so that I can easily merge all the different age group , race , gender population with my master data merging on the variable of year and county.

Can anyone kindly tell me how I can create the wide format of my using data for variables age, race , gender , hispanic on the basis of pop value ( the number of population for those categories in those counties for that particular year).

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30355

29 Jul 2022, 11:55

There are several different ways to do this, depending on how you want the resulting variables to be named. Here's how I would do it:

Code:

label define race 1 "W" 2 "B" 3 "AIAN" 4 "API", modify
label define hispanic 0 "NH" 1 "H", modify
label define sex 1 "M" 2 "F", modify
foreach v of varlist race hispanic gender age {
    decode `v', gen(_`v')
    drop `v'
    rename _`v' `v'
    
}
replace age = subinstr(age, " years", "", .)
replace age = subinstr(age, "-", "_", .)

egen demographics = concat(race hispanic gender age), punct("_")
drop race hispanic gender age
rename pop pop_
reshape wide pop_, i(county year) j(demographics) string

Comment

Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#3

29 Jul 2022, 13:27

WOW! Mr. Schechter, thanks would be an understatement! My steep learning curve of stata has been less steep and smoother because of statalist community and your' relentless contribution! Much obliged and appreciate this so much!
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#4

30 Jul 2022, 16:14

Mr. Schechter, I sincerely apologize for asking for an extension of favor on this similar issue. After following your code these 7 long variables convert into 149 wide variables.

My research goal is to find 7 j variables for each county in a specific year: total white population, total Asian population, total black population, total American Indian population, total male population, total Hispanic population, total population for each age bracket from 20-64 (9 age brackets coded as from 5-13)

I sincerely apologize for not articulating what I was asking for because I wasn't even sure how I can execute it. Now that I think of, I did a pretty bad job in #1 post to ask for the exact stuff I'm aiming for.

Is there anything that I can do to only find the information of bold letters. I think if I can count the number of Hispanic people based on all the age brackets I can find the total population of Hispanic in a county for a specific year. And, keep repeating the process of all the other desired variables. But, I'm not exactly sure how I cantranslate this into code.

Would you be kind enough to help me one more time? I apologize again for my naïveté!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#5

30 Jul 2022, 18:36

Code:

use using_data, clear label define race 1 "W" 2 "B" 3 "AIAN" 4 "API", modify label define hispanic 0 "NH" 1 "H", modify label define sex 1 "M" 2 "F", modify foreach v of varlist race hispanic gender age { decode `v', gen(_`v') drop `v' rename _`v' `v' } replace age = subinstr(age, " years", "", .) replace age = subinstr(age, "-", "_", .) local demographics race hispanic gender age foreach d of local demographics { by county year `d', sort: egen pop_`d' = total(pop) frame put county year `d' pop_`d', into(`d'_pop) frame `d'_pop { duplicates drop rename pop_`d' pop_ reshape wide pop_, i(county year) j(`d') string } } use master_data, clear destring county, replace foreach d of local demographics { frlink m:1 county year, frame(`d'_pop) frget pop_*, from(`d'_pop) drop `d'_pop }

At the end of this code, you can drop the female and non-hispanic population variables, which it seems you don't need.

In the example data, you do not get a variable for the Asian population because no observations for that group appear in the example version of the using file. Presumably, such observations are present in the full data set.

There is one aspect of the data that troubles me. In the using data, the county code is a numeric variable. In the master data it is a string variable with leading zeroes. The -destring- command will convert the latter to numeric (where the leading zero becomes irrelevant). I just worry that in the full data set there may be county codes that contain non-digit characters, or where the presence or absence of a leading zero distinguishes different counties. In the first case, the -destring- command will fail and the subsequent -frlink- commands will halt execution with a type mismatch error message. In the latter case, the data from two different counties will fall under the same id code and be indistinguishable from that point--in this situation there will be no error message. So I caution you to check the county codes in the master data and verify that these problems will not arise.
1 like
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#6

30 Jul 2022, 19:51

I am so humbled to learn from experienced and gracious members of this community. Very grateful for guiding me through the right and efficient process despite my clear lack of understanding of what I was looking for in the beginning. Thank you for being so open to sharing knowledge with students like us, Mr. Schechter!
Comment

Announcement

Merging using data with repeated cross section master data by changing long to wide format of variables

Comment

Comment

Comment

Comment

Comment