Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to generate household head characteristics

    Dear all,

    How do I generate household head age, education level and other household size. My data doesn't have a PID though it has hhid and uniqkey and respondent characteristics (questionnaire).

    Am using Stata/IC 14.1

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(uniqkey hhid gender_resp age_resp relation_head edu_resp)
    23592911 11 1 18 3 4
    23456297 50 2 35 2 2
    24080977 25 2 30 2 1
    23026061 50 1 18 3 4
    24080972 12 2 49 1 1
    23996364 93 2 33 1 1
    23466688 37 1 29 4 2
    23526939 19 1 46 1 2
    22952404 22 1 45 1 5
    23716844 15 2 22 2 4
    23332897 41 2 19 2 2
    23331005  4 1 24 5 7
    22437981 94 1 23 1 5
    23963878 33 2 36 1 7
    23686337  8 2 24 2 3
    22656751 67 2 21 2 7
    24121285 65 2 70 2 4
    23100222 52 2 37 2 5
    23182298 81 2 23 2 4
    23248881 11 1 23 1 5
    23028824 28 1 35 1 5
    end
    
    label values gender_resp gender_resp
    label def gender_resp 1 "Male", modify
    label def gender_resp  2 "Female", modify
    label values relation_head relation_head
    label def relation_head 1 "Head", modify
    label def relation_head 2 "Spouse", modify
    label def relation_head 3"Son/Daughter", modify
    label def relation_head 4 "Father/Mother", modify
    label def relation_head 5 "Sister/Brother", modify
    label values edu_resp edu_resp
    label def edu_resp 1 "None", modify
    label def edu_resp 2 "Some primary ", modify
    label def edu_resp 3 "Primary completed", modify
    label def edu_resp 4 "Some secondary ", modify
    label def edu_resp 5 "Secondary completed", modify
    label def edu_resp 7 "University degree", modify
    ******Head Age
    egen age_hh = total(age_resp* (age_resp >=16)), by(pid)
    replace age_hh = age_hh - age_resp * (age_resp >= 18)

    I tried this for age of household head which seems to have worked though when I sum age_hh I get mean age_hh of approximately 80 years old.


  • #2
    I'm not sure I understand what you want to do here. Your statement seems straightforward, but the code you have tried does not seem to me to be related to finding the characteristics of the household head.

    There is also the problem that, at least in your example data, there are numerous households that have no head at all.

    The following code begins by verifying that each household has one and only one head. Then it calculates the age and education of that household head, and the household size.

    Code:
    //    VERIFY EACH HOUSEHOLD HAS EXACTLY ONE HEAD
    by hhid, sort: egen head_count = total(relation_head == 1)
    assert head_count == 1
    
    //    CALCULATE AGE AND EDUCATION OF HOUSEHOLD HEAD
    foreach x in age edu {
        by hhid, sort: egen `x'_hh = max(cond(relation_head == 1, `x', .))
    }
    
    //    CALCULATE HOUSEHOLD SIZES
    by hhid, sort: gen hh_size = _N

    Comment


    • #3
      Here's a different approach that has the slightly added advantage that the variable labels from the individual variables are carried over to the head variables. As Clyde pointed out, your sample data was relatively unhelpful because your data had not first been sorted by household, so it was a random mixture of individuals. I fiddled with it to produce sample data that worked for my demonstration.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input double(uniqkey hhid gender_resp age_resp relation_head edu_resp)
      23248881 11 1 23 1 5
      23592911 11 1 18 3 4
      24080972 12 2 49 1 1
      23716844 12 2 22 2 4
      23526939 19 1 46 1 2
      22952404 22 1 45 1 5
      24080977 22 2 30 2 1
      23028824 28 1 35 1 5
      23963878 33 2 36 1 7
      23466688 33 1 29 4 2
      23332897 50 2 19 1 2
      23456297 50 2 35 2 2
      23026061 50 1 18 3 4
      end
      label values gender_resp gender_resp
      label def gender_resp 1 "Male", modify
      label def gender_resp  2 "Female", modify
      label values relation_head relation_head
      label def relation_head 1 "Head", modify
      label def relation_head 2 "Spouse", modify
      label def relation_head 3"Son/Daughter", modify
      label def relation_head 4 "Father/Mother", modify
      label def relation_head 5 "Sister/Brother", modify
      label values edu_resp edu_resp
      label def edu_resp 1 "None", modify
      label def edu_resp 2 "Some primary ", modify
      label def edu_resp 3 "Primary completed", modify
      label def edu_resp 4 "Some secondary ", modify
      label def edu_resp 5 "Secondary completed", modify
      label def edu_resp 7 "University degree", modify
      
      tempfile survey
      save `survey'
      
      drop if relation_head != 1
      bysort hhid: assert _N==1
      keep hhid *_resp
      rename (*_resp) (*_head)
      list, noobs
      tempfile heads
      save `heads'
      
      use `survey', clear
      merge m:1 hhid using `heads'
      drop _merge
      list hhid relation_head gender_resp age_resp gender_head age_head, noobs sepby(hhid)
      Code:
      . list, noobs
      
        +--------------------------------------------------+
        | hhid   gender~d   age_head              edu_head |
        |--------------------------------------------------|
        |   11       Male         23   Secondary completed |
        |   12     Female         49                  None |
        |   19       Male         46          Some primary |
        |   22       Male         45   Secondary completed |
        |   28       Male         35   Secondary completed |
        |--------------------------------------------------|
        |   33     Female         36     University degree |
        |   50     Female         19          Some primary |
        +--------------------------------------------------+
      Code:
      . list hhid relation_head gender_resp age_resp gender_head age_head, noobs sepby(hhid)
      
        +------------------------------------------------------------------+
        | hhid   relation_head   gender~p   age_resp   gender~d   age_head |
        |------------------------------------------------------------------|
        |   11            Head       Male         23       Male         23 |
        |   11    Son/Daughter       Male         18       Male         23 |
        |------------------------------------------------------------------|
        |   12            Head     Female         49     Female         49 |
        |   12          Spouse     Female         22     Female         49 |
        |------------------------------------------------------------------|
        |   19            Head       Male         46       Male         46 |
        |------------------------------------------------------------------|
        |   22            Head       Male         45       Male         45 |
        |   22          Spouse     Female         30       Male         45 |
        |------------------------------------------------------------------|
        |   28            Head       Male         35       Male         35 |
        |------------------------------------------------------------------|
        |   33            Head     Female         36     Female         36 |
        |   33   Father/Mother       Male         29     Female         36 |
        |------------------------------------------------------------------|
        |   50            Head     Female         19     Female         19 |
        |   50          Spouse     Female         35     Female         19 |
        |   50    Son/Daughter       Male         18     Female         19 |
        +------------------------------------------------------------------+

      Comment


      • #4
        Thank you Clyde and William.

        I do highly appreciate your assistance in this. However, when I use both processes it return an error of the form:

        Code:
        by hhid, sort: egen head_count = total(relat_h== 1)
        
        assert head_count == 1
        4,341 contradictions in 4,352 observations
        assertion is false
        
        assert head_count == 1
        4,341 contradictions in 4,352 observations
        assertion is false
        What does this mean and how do I solve it?

        I am not good in saving my data in ` tempfile' (I understand it is saved in temporary folder in my window but don't know how to change the path) hence I opted to follow Clyde Stata commands.


        Regards,

        Gatelik
        Last edited by Gatelik Tony; 23 Jun 2018, 23:40.

        Comment


        • #5
          What this means is that most of your households (distinct values of hhid) have more than one person who is identified as head. In a household with two heads, characteristics like "age of head" are not possible to define.

          Looking at the fact that most households have more than one head, is it possible that you have panel data, where each household is interviewed more than once? If so, then there is presumably a variable - which you didn't share with us - that indicates which interview any particular observation came from. Together with hhid, that should better identify your households. If the name of your variable is, for example, survey_year, then where you see
          Code:
          hhid
          in the sample code in posts 2 and 3, you should replace it with
          Code:
          hhid survey_year
          If that is not the case, then instead of the assert command, try
          Code:
          browse if head_count != 1
          to open the Data Browser window displaying the observations for households with multiple heads and, together with the documentation for your data, try to better understand your data.

          In the code I provided in post #3, the reason the data is stored in a temporary file is because having read it the sample data I needed to save it somewhere and I chose a temporary file. Your data presumably already exists in a permanent disk file somewhere and you do not need to save it again in a temporary file.
          Last edited by William Lisowski; 24 Jun 2018, 05:22.

          Comment


          • #6
            Thank you for your response. I did perform data check using - datacheck- available from SSC. The return indicates the problem is emanating from other household members(spouse, children, father/mother and other relatives). I tried to keep observations to only relationship_head==1 but the assert error still exist.

            I only have hhid, uniqkey- which is the questionnaire key and the data is a cross-sectional not a panel.

            Comment


            • #7
              The assert command tells you that there are 4,351 individuals in your dataset, of which 4,341 are in households for which more than one observation has a value of 1 for relat_h. Or, perhaps, 0 observations have a value of 1 for relat_h.

              The problem may be because you ran incorrect code.

              Based on your sample data in post #1, which shows a variable named relation_head, and no variable named relat_h, Clyde wrote the following
              Code:
              by hhid, sort: egen head_count = total(relation_head == 1)
              assert head_count == 1
              In post #4 you tell us you ran
              Code:
              by hhid, sort: egen head_count = total(relat_h== 1)
              
              assert head_count == 1
              Now, either you checked some variable relat_h that is not the variable that indicates relationship to the head, or the data you presented in post #1 is not representative of your data.

              Try replacing the assert command with
              Code:
              list if head_count!=1
              Last edited by William Lisowski; 24 Jun 2018, 08:45.

              Comment


              • #8
                I apologize for the confusion of naming. The data is the right one and relation_head is the same as relat_h.


                Code:
                set more off
                /    VERIFY EACH HOUSEHOLD HAS EXACTLY ONE HEAD
                
                by hhid, sort: egen head_count = total(relat_h== 1)
                list if head_count!=1
                
                //    CALCULATE AGE AND EDUCATION OF HOUSEHOLD HEAD
                foreach x in age_resp edu_resp{
                    by hhid, sort: egen `x'_hh = max(cond(relat_h == 1, `x', .))
                }
                
                sum age_resp_hh edu_resp_hh head_count
                
                    Variable |        Obs        Mean    Std. Dev.       Min        Max
                -------------+---------------------------------------------------------
                 age_resp_hh |      8,654    86.05639    9.311528         28        100
                 edu_resp_hh |      8,654    6.744627    .6037007          1          7
                  head_count |      8,665    43.98881    15.17873          0         71
                The mean age is quite high....

                Last edited by Gatelik Tony; 24 Jun 2018, 09:57.

                Comment


                • #9
                  Post #8 raises more questions than it answers.
                  • We now know - from the variable names - that the data you present in post #1 is not the data you used in post #4 and post #8.
                  • We also know that the data in post #4 had 4352 observations, according to the assert command, but the summary command in post #8 shows 8,665 observations.
                  • That's three different sets of data, based on variable names and numbers of observations. What else might be different? I suspect you've made new datasets derived from the dataset you presented in post #1 and in the process made some significant changes that are causing the problems you see, including those reported by the assert command.
                  One thing we don't know is what the variables uniqid and hhid in the data in post #1 actually identify. You have neglected to tell us, and we have made some assumptions that may be wrong. Thinking of datasets I've used, there was one in which the "household ID" was actually an identifier for the individual within that household. Some other variable actually contained a value that was different for each distinct household.
                  • The material presented in the [CODE] block in post #8 is apparently heavily edited rather than copied and pasted from the Results window. As with your code in post #4, the commands do not show the leading ". " prompt that is displayed in the Stata Results window. The second line is not proper syntax for a comment.
                  • Because you included "set more off" in post #8 I suspect the list command produced extensive output that you have neither shared nor examined to find the source of your values of head_count!=1.
                  • I feel confident in that assertion because the summary results in post #8 show that your values of head_count lie between 0 and 71. You ignore this and comment instead on the mean age of the household head being high!
                  • In post #5 I instructed you on how to look at your data to try to understand why head_count has values other than 1. Apparently you chose instead to do something with the SSC datacheck command. I can assure that that if your households have any number of heads other than 1, the results you have achieved for head's age will be wrong, as I told you in post #5.
                  • Your mean age is high in part because you are calculating the mean age of the heads incorrectly. Imagine a household with 3 members - a 40-year-old head, a 30-year-old spouse, and a 1 year old child. Each of those observations will have a value of 40 for age_resp_hh and so 3 values of "40" will be entered into the mean for this one household head. To correctly compute the mean age of the household heads, you would want
                  Code:
                  sum age_resp_hh if relation_head==1
                  and you can check this by comparing it to
                  Code:
                  sum age_resp if relation_head==1
                  At this point your problem is that you do not understand your data. To understand your data and your results, you must look at your data. We cannot do that for you.
                  Last edited by William Lisowski; 24 Jun 2018, 14:23.

                  Comment


                  • #10
                    Thank you sir for your help. Am new in this forum hence I might make few mistakes in the best way of posting the data despite reading FAQ #12.

                    This is my explanation regarding the data I posted

                    - #1 is a sample of the entire data. The only changes is variable renaming i.e relation_head to relat_h but the data is same throughout.
                    - However, in order to remove benefit of doubt and since I need help on this let me retain variable name as in post #1
                    - The data has 8,665 observations in total, what I had done in post #4 was to
                    Code:
                     keep relation_head==1
                    Code:
                    ta relation_head
                    
                       relationship to |
                             household |      Freq.     Percent        Cum.
                    -------------------+-----------------------------------
                                  Head |      4,352       50.23       50.23
                                Spouse |      2,752       31.76       81.98
                          Son/Daughter |        820        9.46       91.45
                         Father/Mother |        372        4.29       95.74
                        Sister/Brother |        144        1.66       97.40
                            Grandchild |         94        1.08       98.49
                        Other relative |        113        1.30       99.79
                    Other non-relative |         18        0.21      100.00
                    -------------------+-----------------------------------
                                 Total |      8,665      100.00
                    - I have followed all your recommendation to the latter. For instance, when I replace assert command with
                    Code:
                    list if head_count!=1
                    - I get the following error
                    Code:
                     clear
                    
                    . input double(uniqkey hhid gender_resp age_resp relation_head edu_resp)
                    
                            uniqkey        hhid  gender_r~p    age_resp  relation~d    edu_resp
                      1. 23100715 1 0 80 1 2
                      2. 23057547 1 0 31 2 3
                      3. 23274081 1 1 44 1 2
                      4. 23749048 1 1 55 1 4
                      5. 23243503 1 0 24 1 3
                      6. 23172349 1 1 43 1 6
                      7. 23906022 1 1 45 1 3
                      8. 22641796 1 0 30 2 1
                      9. 22913024 1 0 25 2 3
                     10. 23914304 1 0 34 1 1
                     11. 22549839 1 1 34 1 1
                     12. 23264638 1 1 46 1 4
                     13. 22782580 1 0 40 2 3
                     14. 23262017 1 1 61 1 2
                     15. 23038080 1 1 20 1 3
                     16. 22435662 1 0 37 1 5
                     17. 23376548 1 0 66 2 2
                     18. 23929260 1 0 81 1 1
                     19. 22803692 1 1 22 4 1
                     20. 23232463 1 0 35 2 2
                     21. 22436483 1 1 21 4 5
                     22. 23231202 1 0 50 2 2
                     23. 23766268 2 1 40 1 1
                     24. 23027864 2 1 23 1 6
                     25. 23102007 2 1 66 1 3
                     26. 23996659 2 0 37 2 5
                     27. 23885477 2 0 17 4 4
                     28. 23383315 2 1 37 1 5
                     29. 23681554 2 0 40 1 3
                     30. 23358174 2 0 29 1 4
                     31. 23564467 2 0 60 2 3
                     32. 23565050 2 1 25 3 5
                     33. 22657283 2 0 24 2 1
                     34. 23273920 2 0 30 2 4
                     35. 22652368 2 1 21 3 5
                     36. 23297018 2 0 33 2 3
                     37. 22930350 3 1 36 1 3
                     38. 22683962 3 0 41 1 3
                     39. 23312313 3 0 44 1 2
                     40. 23041366 3 0 16 5 4
                     41. 23798868 3 0 46 2 3
                     42. 22795777 3 0 32 2 3
                     43. 22872566 3 0 35 2 2
                     44. 23686088 3 0 29 2 5
                     45. 23062772 3 0 16 4 2
                     46. 23189013 3 0 50 1 4
                     47. 22620672 3 0 33 2 3
                     48. end
                    
                    .
                    . tempfile survey
                    
                    . save `survey'
                    file C:\Users\KDIS\AppData\Local\Temp\ST_01000001.tmp saved
                    
                    .
                    . drop if relation_head!= 1
                    (24 observations deleted)
                    
                    . list if relation_head!=1
                    
                    . keep hhid *_resp
                    
                    . rename (*_resp) (*_head)
                    
                    . list, noobs
                    
                      +---------------------------------------+
                      | hhid   gender~d   age_head   edu_head |
                      |---------------------------------------|
                      |    1          0         80          2 |
                      |    1          1         44          2 |
                      |    1          1         55          4 |
                      |    1          0         24          3 |
                      |    1          1         43          6 |
                      |---------------------------------------|
                      |    1          1         45          3 |
                      |    1          0         34          1 |
                      |    1          1         34          1 |
                      |    1          1         46          4 |
                      |    1          1         61          2 |
                      |---------------------------------------|
                      |    1          1         20          3 |
                      |    1          0         37          5 |
                      |    1          0         81          1 |
                      |    2          1         40          1 |
                      |    2          1         23          6 |
                      |---------------------------------------|
                      |    2          1         66          3 |
                      |    2          1         37          5 |
                      |    2          0         40          3 |
                      |    2          0         29          4 |
                      |    3          1         36          3 |
                      |---------------------------------------|
                      |    3          0         41          3 |
                      |    3          0         44          2 |
                      |    3          0         50          4 |
                      +---------------------------------------+
                    
                    . tempfile heads
                    
                    . save `heads'
                    file C:\Users\KDIS\AppData\Local\Temp\ST_01000002.tmp saved
                    
                    .
                    . use `survey', clear
                    
                    . merge m:1 hhid using `heads'
                    variable hhid does not uniquely identify observations in the using data
                    r(459);
                    
                    end of do-file
                    
                    r(459);
                    - I highly appreciate your concern. My conclusion is that after thorough observation I have noticed when I use hhid I get this error particularly for other household members besides the head. I tried using the uniqkey which is the serial number for the questionnaire since I don't have any other identifier and all the codes are working perfectly. Though am also interested in deriving head characteristics from those who responded as related to the household head as either spouse/children etc.

                    Is it possible to get head characteristics from 4313 observation?

                    Code:
                    
                    di  8665-4352
                    4313
                    -
                    Last edited by Gatelik Tony; 25 Jun 2018, 04:27.

                    Comment


                    • #11
                      One thing we don't know is what the variables uniqid and hhid in the data in post #1 actually identify. You have neglected to tell us, and we have made some assumptions that may be wrong. Thinking of datasets I've used, there was one in which the "household ID" was actually an identifier for the individual within that household. Some other variable actually contained a value that was different for each distinct household.
                      You have explained uniqid. How does the documentation for the survey describe the variable hhid?

                      Comment


                      • #12
                        hhid is the household identifier/ number according to the documentation.

                        Comment


                        • #13
                          "household identifier" are two words. What do those words mean? What does the "household identifier" identify? Your documentation - perhaps not the codebook, perhaps other documentation for the survey - should say what its purpose is and how it is to be interpreted.

                          Apparently the uniqkey identifies each distinct questionnaire. That tells us nothing about how questionnaires correspond to households. Is there one questionnaire per household? Or one per individual?

                          This is the key to all the problems you have had. The "household identifier" does not identify households, as its name suggests is should. And we don't know if the uniqkey identifies households. You say your code is working perfectly using uniqkey instead of hhid, but what you write suggests it was run only on data for heads of households. The code Clyde provided in post #2 was designed to be run on the entire dataset of heads, spouses, children, etc.

                          If every individual in a household has the same uniqkey, then uniqkey, not hhid, functions as a "household identifier" that identifies members of the same household.

                          Your 48 sample observations do not show two individuals with the same uniqkey. Perhaps if you were to
                          Code:
                          sort uniqkey hhid
                          list uniqkey hhid relation_head in 1/100
                          it would be clear if uniqkey identifies households.

                          If uniqkey does not identify households, you need to find some way of identifying which individuals are members of the same household. There is no avoiding this requirement. You have to know which non-heads are in the same household as the heads. Turn to the survey documentation for advice. My experience is that too many survey users do not take the time to thoroughly read the documentation and examine the data.

                          If uniqkey does indeed identify members of the same household, then if you apply Clyde's advice from post #2 to your 8665 observations, replacing "hhid" with "uniqkey", you should get what you want. The variables giving the household head's characteristics will be exist on each observation having the same uniqkey.

                          Comment

                          Working...
                          X