Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating household level variable with data in wide format

    Hi!

    I'm working with a household data set, to which I imported several individual-level variables (e.g. sex, age, ethnicity) from a corresponding data set. I reshaped the data from long to wide so that within each household ID the individual variables became sex1, sex2, sex3 etc so that I could run analyses on a household level.

    I'm currently trying to create a household level ethnicity variable to essentially say if all household members = 2 then the household is a 2. This is my code thus far: (q1018 is the variable for ethnicity)

    gen hhold_ethn=0
    replace hhold_ethn =. if q10181==.| q10182==.| q10183==.| q10184==.| q10185==.
    replace hhold_ethn =1 if q10181==1| q10182==1| q10183==1| q10184==1| q10185==1
    replace hhold_ethn =2 if q10181==2| q10182==2| q10183==2| q10184==2| q10185==2
    replace hhold_ethn =3 if q10181==3| q10182==3| q10183==3| q10184==3| q10185==3
    replace hhold_ethn =4 if q10181==4| q10182==4| q10183==4| q10184==4| q10185==4
    replace hhold_ethn =8 if q10181==8| q10182==8| q10183==8| q10184==8| q10185==8

    replace hhold_ethn =87 if q10181==87| q10182==87| q10183==87| q10184==87| q10185==87

    For most households they are all the same ethnicity, the prob is that I want to code households where for example q10181=1 and q10183=3 - as "mixed" and unfortunately, the code above doesn't seem to be taking that into account and just assigning the mixed households as one number or the other.

    Any help on figuring this out (I've no doubt there is a very simple solution) would be greatly appreciated!

    Cheers,

    Maddie

  • #2
    You might have been better off leaving it long. This is a great paper to read for working with Panel Data:

    http://www.stata-journal.com/sjpdf.h...iclenum=dm0033

    But luckily, I don't think this problem is too hard. See the help for egen, and for the diff function in particular. It says "diff(varlist) creates an indicator variable equal to 1 if the variables in varlist are not equal and 0 otherwise."

    So, I think you want something like

    Code:
    egen mixed = diff(q10181 q10182 q10183 q10184 q10185)
    replace hhold_ethn = 7 if mixed == 1
    Incidentally, you are using ors when I think you want ands. So, for example, after the =1 command, the household gets coded as 1 if any member of the household is a 1. But, if a household has all 1s except for a single 3, it is going to get coded as a 3. In a mixed family, whoever has the ethnicity that you code for last wins, e.g. a 4 is going to beat out any members who are 1s, 2s or 3s.

    But, I imagine going from long to wide created a lot of missing values, unless every household has at least 5 members. That is going to create complications too, e.g. if one member were missing I think the household would wind up with the initial value of 0. Missing data might also cause my diff solution not to do what you want, e.g. if you had 4 1s and a missing data for a 5th person that doesn't exist, I am not sure what happens.

    Anyway, try what I said, check it carefully, and if it doesn't work, write back. I a worried that missing data created by creating variables for nonexistant household members may cause grief, but there is no point in thinking about it too much until we know if you actually have missing data.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      I would generate the variable before moving to wide format, but how about something like:

      Code:
      clear all
      set more off
      
      *----- exmple data -----
      
      input ///
      hh eth1 eth2 eth3
      1 2 2 2
      2 1 2 1
      4 . . .
      5 3 . 1
      3 3 3 .
      end
      
      label define ethlbl 1 "white" 2 "black" 3 "pink" 999 "mixed"
      label values eth* ethlbl
      
      list
      
      *----- what you want -----
      
      egen ethmin = rowmin(eth1-eth3)
      egen ethmax = rowmax(eth1-eth3)
      
      gen eth = 999
      replace eth = ethmin if ethmin == ethmax
      
      label values eth ethlbl
      
      list
      ?

      You should be careful with the implications of your missings.
      You should:

      1. Read the FAQ carefully.

      2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

      3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

      4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

      Comment


      • #4
        Hi Richard,

        Thanks so much for your prompt response and helpful advice.
        I do indeed have a lot of missing data as many of the households only have one or two members and your hunch was right -. the & command caused only one "real change" to be made (the one household with 5 people!).
        Running the egen code also ran into problems due to the missing data as even households with only one respondent were coded as 1.

        Maybe I look into other options for using this variable.....!

        Cheers,

        Maddie
        Last edited by Maddie R; 25 Jul 2014, 01:05.

        Comment


        • #5
          Hi Roberto,

          Thanks for your helpful advice too! I'll give that a go too and see how that turns out.

          Cheers,

          Maddie

          Comment


          • #6
            Since you have missing data Roberto's code is definitely the way to go.

            I fear this wide format could keep on causing you grief. You'll always have to be careful that missing data is being handled correctly. Consider using the egen commands as they will often handle missing the way you want it handled.

            For example, suppose you wanted to count the # of minority members in the household. Add these lines to the bottom of Roberto's code.

            Code:
            recode eth1 eth2 eth3 (1 = 0)(2/3 = 1) (else = .),  gen(min1 min2 min3)
            gen nminor1 = min1 + min2 + min3
            egen nminor2 = rowtotal(min1 min2 min3), missing
            list
            Your first impulse might be to use the gen command above, but that would be wrong, since missing on any of the three variables will cause the nminor1 variable to also be missing. You should use the egen command instead.

            Code:
            . list
            
                 +-----------------------------------------------------------------------------------------------+
                 | hh    eth1    eth2    eth3   ethmin   ethmax     eth   min1   min2   min3   nminor1   nminor2 |
                 |-----------------------------------------------------------------------------------------------|
              1. |  1   black   black   black        2        2   black      1      1      1         3         3 |
              2. |  2   white   black   white        1        2   mixed      0      1      0         1         1 |
              3. |  4       .       .       .        .        .       .      .      .      .         .         . |
              4. |  5    pink       .   white        1        3   mixed      1      .      0         .         1 |
              5. |  3    pink    pink       .        3        3    pink      1      1      .         .         2 |
                 +-----------------------------------------------------------------------------------------------+
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            Stata Version: 17.0 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Hi again,

              I'm so close with that code! I've run it and looked at the variables and found a curious thing where for some households, the code has worked and if there is one person who is listed as 2 (white) then the eth variable has coded the household as 2. In other households however, the rowmax function is giving me a huge number e.g. 64, 56, 70 etc which of course codes that household as "mixed" when that is not the case.

              I thought it might be because "Don't know" =8 and "Other" = 87 but in most of the households where it has incorrectly coded as "mixed" there is only one household member or there are a few members but all with the same ethnicity.

              Any thoughts on what might be causing this?

              Thanks again for all your help,

              Maddie

              Comment


              • #8
                I don't know how you are getting 65, 56, etc. You may wish to to show us the code you are using now. Could it be that the data set erroneously actually has codes like 65 and 56?

                As for those 8s and 87s, you may wish to recode them to distinct MD codes, e.g. 8 = .a, 87 = .b. However, is other really a missing code? If you have 4 whites and one other, why not code it as mixed?

                Anyway, why don't you show the current code. That will be easier than us guessing. You might also list a few cases where the coding seems wrong to you.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                Stata Version: 17.0 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  That's what I thought might be the case, but for my ethnicity variable:
                  1= African/black
                  2 = white
                  3= coloured
                  4= Asian/Indian
                  8 = Don't know
                  87 = Other
                  999 = Mixed

                  I've had a look at the data and there don't appear to be any strange values. This is just a bit of output to illustrate how the ethnicity variable looks in my data.
                  +------------------------------------------------------------------------------------------+
                  | hhid eth1 eth2 eth3 eth4 eth5 |
                  |------------------------------------------------------------------------------------------|
                  1. | 15580090 Coloured . . . . |
                  2. | 15580184 Coloured . . . . |
                  3. | 15580185 Coloured Coloured . . . |
                  4. | 15580186 Coloured Coloured . . . |
                  5. | 15580187 Coloured . . . . |
                  |------------------------------------------------------------------------------------------|
                  6. | 15580189 Coloured . . . . |
                  7. | 15580191 Coloured Coloured Coloured . . |
                  8. | 15580193 . . . . . |
                  9. | 15580196 Coloured Coloured . . . |
                  10. | 15580197 . Coloured . . . |
                  |------------------------------------------------------------------------------------------|
                  11. | 15580200 Coloured . . . . |
                  12. | 15580203 Coloured . . . . |
                  13. | 15580205 Coloured . . . . |
                  14. | 15580248 Coloured . . . . |
                  15. | 15580253 Coloured . . . . |
                  |------------------------------------------------------------------------------------------|
                  16. | 15580267 Coloured Coloured Coloured . . |
                  17. | 15580272 . . . . . |
                  18. | 15580273 . . . . . |
                  19. | 15580367 African/black . . . . |
                  20. | 15580370 Coloured . . . . |
                  |------------------------------------------------------------------------------------------|
                  21. | 15580375 Coloured . . . . |
                  22. | 15580392 Coloured . . . . |
                  23. | 15580399 Coloured . . . . |
                  24. | 15580408 African/black . . . . |
                  25. | 15580413 . . . . . |
                  |------------------------------------------------------------------------------------------|
                  26. | 15580417 Coloured . . . . |
                  27. | 15580482 Coloured . . . . |
                  28. | 15580485 Coloured Coloured . . . |
                  29. | 15580491 Coloured Coloured Coloured . . |
                  30. | 15580498 Coloured . . . . |
                  |------------------------------------------------------------------------------------------|
                  31. | 15580501 Coloured Coloured . . . |



                  Then I've so far just used the section of Roberto's code

                  egen ethmin=rowmin (eth1-eth5)
                  egen ethmax= rowmax (eth1-eth5)

                  gen eth=999
                  replace eth=ethmin if ethmin==ethmax

                  Hope this is info is helpful let me know if I can explain it any better (apologies if I need to, getting to grips slowly with STATA!), it's just really strange that the code seems to work for some households but not others - for example, observation 1 is correctly coded as a 3 in the eth variable but ob 2 is coded as mixed due to an inflated number.

                  Cheers,
                  Maddie

                  Comment


                  • #10
                    It would help to see the results AFTER adding ethmin, ethmax, and eth. Give a command like

                    Code:
                    list hhid eth1 eth2 eth3 eth4 eth5 ethmin ethmax eth in 1/30
                    AND, click on the A that appears on the top right side of your post (right next to a smilie face) and chose #. This stands for code, and will allow you to enter code and output that is much easier to read.
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    Stata Version: 17.0 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment


                    • #11
                      Also are eth1 through eth5 actually consecutive in your data set? If not any other variable in between then would get included in the rowmin & rowmax calculations. You might try changing your egen commands to

                      Code:
                      egen ethmin=rowmin (eth1 eth2 eth3 eth4 eth5)
                      egen ethmax= rowmax (eth1 eth2 eth3 eth4 eth5)
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      Stata Version: 17.0 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        YES! I think that was it!! once I re-did the code as you suggested and it has finally worked! It must have been pulling those large numbers from the age variable.

                        Thank you so much for your help Richard, really appreciate it. As we say in Australia, you're a "bloody legend"!

                        Cheers,

                        Maddie

                        Comment


                        • #13
                          Roberto is a bloody legend too (assuming that is a good thing!) He came up with the original code. When you said you were getting values like 56 and 64, it occurred to me that some other variable might be in the mix.

                          Remember as things now stand, 8 and 87 are being treated just like any other race. That may be what you want, but if not you should recode them as I suggested earlier.
                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          Stata Version: 17.0 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment


                          • #14
                            Yes, you're both bloody legends! (Most definitely a good thing).
                            It's really reassuring to know that there is such a strong online support community for STATA

                            Comment


                            • #15
                              Glad this worked out for you.

                              So Richard just showed you how a varlist works. According to help varlist:

                              The - character indicates that all variables in the dataset, starting with the variable to the left of the -
                              and ending with the variable to the right of the - are to be returned.
                              An example:

                              Code:
                              clear all
                              set more off
                              
                              *----- example dataset -----
                              
                              input ///
                              var1 var2 varweird var3
                              1 2 999 3
                              end
                              
                              *----- list as is -----
                              
                              list var1-var3
                              
                              *----- put weird variable last and list -----
                              
                              order varweird, last
                              
                              list var1-var3
                              Originally posted by Maddie R View Post
                              Yes, you're both bloody legends! (Most definitely a good thing).
                              Nice to hear such an effusive "thanks".

                              Lastly, two gentle reminders:

                              1. Members have a strong preference for use of full real names in the forum. You're almost there. You can hit the "contact us" button and ask the site administrators to change it.
                              2. The correct spelling is Stata, not STATA.

                              These and other advice can be found in the FAQ.
                              You should:

                              1. Read the FAQ carefully.

                              2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                              3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                              4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                              Comment

                              Working...
                              X