Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging using data with repeated cross section master data by changing long to wide format of variables

    I need to merge master data (which is repeated cross section data - observations of couple of variables for each county at 4 different quarter for a particular year) with repeated cross section of my using data. The format for my using data is given here :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int year float county byte(race hispanic gender) float(age pop)
    1990 1001 1 0 1  5  882
    1990 1001 1 0 1  6 1059
    1990 1001 1 0 1  7 1148
    1990 1001 1 0 1  8 1059
    1990 1001 1 0 1  9 1051
    1990 1001 1 0 1 10  876
    1990 1001 1 0 1 11  792
    1990 1001 1 0 1 12  670
    1990 1001 1 0 1 13  519
    1990 1001 1 0 2  5  824
    1990 1001 1 0 2  6 1096
    1990 1001 1 0 2  7 1199
    1990 1001 1 0 2  8 1105
    1990 1001 1 0 2  9 1093
    1990 1001 1 0 2 10  886
    1990 1001 1 0 2 11  814
    1990 1001 1 0 2 12  671
    1990 1001 1 0 2 13  541
    1990 1001 1 1 1  5    4
    1990 1001 1 1 1  6    6
    1990 1001 1 1 1  7   10
    1990 1001 1 1 1  8    8
    1990 1001 1 1 1  9    5
    1990 1001 1 1 1 10    4
    1990 1001 1 1 1 11    3
    1990 1001 1 1 1 12    4
    1990 1001 1 1 1 13    4
    1990 1001 1 1 2  5    4
    1990 1001 1 1 2  6    7
    1990 1001 1 1 2  7   11
    1990 1001 4 0 2  9    5
    1990 1001 4 0 2 10    7
    1990 1001 4 0 2 11   14
    1990 1001 4 0 2 12    9
    1990 1001 4 0 2 13    6
    1990 1001 4 1 2  7    1
    1990 1003 1 0 1  5 2308
    1990 1003 1 0 1  6 2822
    1990 1003 1 0 1  7 3148
    1990 1003 1 0 1  8 3128
    1990 1003 1 0 1  9 3146
    end
    label values race race
    label def race 1 "White", modify
    label def race 2 "Black", modify
    label def race 3 "American Indian/Alaska Native (1990+)", modify
    label def race 4 "Asian or Pacific Islander (1990+)", modify
    label values hispanic hispanic
    label def hispanic 0 "Non-Hispanic", modify
    label def hispanic 1 "Hispanic", modify
    label values gender sex
    label def sex 1 "Male", modify
    label def sex 2 "Female", modify
    label values age age
    label def age 5 "20-24 years", modify
    label def age 6 "25-29 years", modify
    label def age 7 "30-34 years", modify
    label def age 8 "35-39 years", modify
    label def age 9 "40-44 years", modify
    label def age 10 "45-49 years", modify
    label def age 11 "50-54 years", modify
    label def age 12 "55-59 years", modify
    label def age 13 "60-64 years", modify

    My master data is like the following

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int year byte qtr str5 county str6 industry_code long avg_wkly_wage
    2000 1 "01000" "10"      555
    2000 2 "01000" "10"      545
    2000 3 "01000" "10"      546
    2000 4 "01000" "10"      586
    2000 1 "01000" "10"      868
    2000 2 "01000" "10"      839
    2000 3 "01000" "10"      952
    2000 4 "01000" "10"      902
    2000 1 "01000" "101"     713
    2000 2 "01000" "101"     703
    2000 3 "01000" "101"     914
    2000 4 "01000" "101"     765
    2000 1 "01000" "1013"    713
    2000 2 "01000" "1013"    703
    2000 3 "01000" "1013"    914
    2000 4 "01000" "1013"    765
    2000 1 "01000" "102"     877
    2000 2 "01000" "102"     846
    1990 1 "01000" "33299"   713
    1990 2 "01000" "33299"   703
    1990 3 "01000" "33299"   914
    1990 4 "01000" "33299"   765
    1990 1 "01000" "332993"  713
    1990 2 "01000" "332993"  703
    1990 3 "01000" "332993"  914
    1990 4 "01000" "332993"  765
    1990 1 "01000" "44-45"   317
    1990 2 "01000" "44-45"   318
    1990 3 "01000" "44-45"   334
    1990 4 "01000" "44-45"   356
    1990 1 "01001" "445"     525
    1990 2 "01001" "445"     481
    1990 3 "01001" "445"     611
    1990 4 "01001" "445"     519
    end

    I figure that I need to create the number of population for each age group, race and gender in a wide format for each year and a county so that I can easily merge all the different age group , race , gender population with my master data merging on the variable of year and county.

    Can anyone kindly tell me how I can create the wide format of my using data for variables age, race , gender , hispanic on the basis of pop value ( the number of population for those categories in those counties for that particular year).


  • #2
    There are several different ways to do this, depending on how you want the resulting variables to be named. Here's how I would do it:
    Code:
    label define race 1 "W" 2 "B" 3 "AIAN" 4 "API", modify
    label define hispanic 0 "NH" 1 "H", modify
    label define sex 1 "M" 2 "F", modify
    foreach v of varlist race hispanic gender age {
        decode `v', gen(_`v')
        drop `v'
        rename _`v' `v'
        
    }
    replace age = subinstr(age, " years", "", .)
    replace age = subinstr(age, "-", "_", .)
    
    egen demographics = concat(race hispanic gender age), punct("_")
    drop race hispanic gender age
    rename pop pop_
    reshape wide pop_, i(county year) j(demographics) string

    Comment


    • #3
      WOW! Mr. Schechter, thanks would be an understatement! My steep learning curve of stata has been less steep and smoother because of statalist community and your' relentless contribution! Much obliged and appreciate this so much!

      Comment


      • #4
        Mr. Schechter, I sincerely apologize for asking for an extension of favor on this similar issue. After following your code these 7 long variables convert into 149 wide variables.

        My research goal is to find 7 j variables for each county in a specific year: total white population, total Asian population, total black population, total American Indian population, total male population, total Hispanic population, total population for each age bracket from 20-64 (9 age brackets coded as from 5-13)

        I sincerely apologize for not articulating what I was asking for because I wasn't even sure how I can execute it. Now that I think of, I did a pretty bad job in #1 post to ask for the exact stuff I'm aiming for.

        Is there anything that I can do to only find the information of bold letters. I think if I can count the number of Hispanic people based on all the age brackets I can find the total population of Hispanic in a county for a specific year. And, keep repeating the process of all the other desired variables. But, I'm not exactly sure how I cantranslate this into code.

        Would you be kind enough to help me one more time? I apologize again for my naïveté!

        Comment


        • #5
          Code:
          use using_data, clear
          
          label define race 1 "W" 2 "B" 3 "AIAN" 4 "API", modify
          label define hispanic 0 "NH" 1 "H", modify
          label define sex 1 "M" 2 "F", modify
          foreach v of varlist race hispanic gender age {
              decode `v', gen(_`v')
              drop `v'
              rename _`v' `v'
              
          }
          replace age = subinstr(age, " years", "", .)
          replace age = subinstr(age, "-", "_", .)
          
          local demographics race hispanic gender age
          
          foreach d of local demographics {
              by county year `d', sort: egen pop_`d' = total(pop)
              frame put county year `d' pop_`d', into(`d'_pop)
              frame `d'_pop {
                  duplicates drop
                  rename pop_`d' pop_
                  reshape wide pop_, i(county year) j(`d') string
              }
          }
          
          use master_data, clear
          destring county, replace
          foreach d of local demographics {
              frlink m:1 county year, frame(`d'_pop)
              frget pop_*, from(`d'_pop)
              drop `d'_pop
          }
          At the end of this code, you can drop the female and non-hispanic population variables, which it seems you don't need.

          In the example data, you do not get a variable for the Asian population because no observations for that group appear in the example version of the using file. Presumably, such observations are present in the full data set.

          There is one aspect of the data that troubles me. In the using data, the county code is a numeric variable. In the master data it is a string variable with leading zeroes. The -destring- command will convert the latter to numeric (where the leading zero becomes irrelevant). I just worry that in the full data set there may be county codes that contain non-digit characters, or where the presence or absence of a leading zero distinguishes different counties. In the first case, the -destring- command will fail and the subsequent -frlink- commands will halt execution with a type mismatch error message. In the latter case, the data from two different counties will fall under the same id code and be indistinguishable from that point--in this situation there will be no error message. So I caution you to check the county codes in the master data and verify that these problems will not arise.

          Comment


          • #6
            I am so humbled to learn from experienced and gracious members of this community. Very grateful for guiding me through the right and efficient process despite my clear lack of understanding of what I was looking for in the beginning. Thank you for being so open to sharing knowledge with students like us, Mr. Schechter!

            Comment

            Working...
            X