How to generate household head characteristics

Gatelik Tony

Join Date: Jun 2018
Posts: 8

How to generate household head characteristics

23 Jun 2018, 09:14

Dear all,

How do I generate household head age, education level and other household size. My data doesn't have a PID though it has hhid and uniqkey and respondent characteristics (questionnaire).

Am using Stata/IC 14.1

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(uniqkey hhid gender_resp age_resp relation_head edu_resp)
23592911 11 1 18 3 4
23456297 50 2 35 2 2
24080977 25 2 30 2 1
23026061 50 1 18 3 4
24080972 12 2 49 1 1
23996364 93 2 33 1 1
23466688 37 1 29 4 2
23526939 19 1 46 1 2
22952404 22 1 45 1 5
23716844 15 2 22 2 4
23332897 41 2 19 2 2
23331005  4 1 24 5 7
22437981 94 1 23 1 5
23963878 33 2 36 1 7
23686337  8 2 24 2 3
22656751 67 2 21 2 7
24121285 65 2 70 2 4
23100222 52 2 37 2 5
23182298 81 2 23 2 4
23248881 11 1 23 1 5
23028824 28 1 35 1 5
end

label values gender_resp gender_resp
label def gender_resp 1 "Male", modify
label def gender_resp  2 "Female", modify
label values relation_head relation_head
label def relation_head 1 "Head", modify
label def relation_head 2 "Spouse", modify
label def relation_head 3"Son/Daughter", modify
label def relation_head 4 "Father/Mother", modify
label def relation_head 5 "Sister/Brother", modify
label values edu_resp edu_resp
label def edu_resp 1 "None", modify
label def edu_resp 2 "Some primary ", modify
label def edu_resp 3 "Primary completed", modify
label def edu_resp 4 "Some secondary ", modify
label def edu_resp 5 "Secondary completed", modify
label def edu_resp 7 "University degree", modify

******Head Age
egen age_hh = total(age_resp* (age_resp >=16)), by(pid)
replace age_hh = age_hh - age_resp * (age_resp >= 18)

I tried this for age of household head which seems to have worked though when I sum age_hh I get mean age_hh of approximately 80 years old.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30104
#2

23 Jun 2018, 11:39

I'm not sure I understand what you want to do here. Your statement seems straightforward, but the code you have tried does not seem to me to be related to finding the characteristics of the household head.

There is also the problem that, at least in your example data, there are numerous households that have no head at all.

The following code begins by verifying that each household has one and only one head. Then it calculates the age and education of that household head, and the household size.

Code:

// VERIFY EACH HOUSEHOLD HAS EXACTLY ONE HEAD by hhid, sort: egen head_count = total(relation_head == 1) assert head_count == 1 // CALCULATE AGE AND EDUCATION OF HOUSEHOLD HEAD foreach x in age edu { by hhid, sort: egen `x'_hh = max(cond(relation_head == 1, `x', .)) } // CALCULATE HOUSEHOLD SIZES by hhid, sort: gen hh_size = _N
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

23 Jun 2018, 11:53

Here's a different approach that has the slightly added advantage that the variable labels from the individual variables are carried over to the head variables. As Clyde pointed out, your sample data was relatively unhelpful because your data had not first been sorted by household, so it was a random mixture of individuals. I fiddled with it to produce sample data that worked for my demonstration.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(uniqkey hhid gender_resp age_resp relation_head edu_resp)
23248881 11 1 23 1 5
23592911 11 1 18 3 4
24080972 12 2 49 1 1
23716844 12 2 22 2 4
23526939 19 1 46 1 2
22952404 22 1 45 1 5
24080977 22 2 30 2 1
23028824 28 1 35 1 5
23963878 33 2 36 1 7
23466688 33 1 29 4 2
23332897 50 2 19 1 2
23456297 50 2 35 2 2
23026061 50 1 18 3 4
end
label values gender_resp gender_resp
label def gender_resp 1 "Male", modify
label def gender_resp  2 "Female", modify
label values relation_head relation_head
label def relation_head 1 "Head", modify
label def relation_head 2 "Spouse", modify
label def relation_head 3"Son/Daughter", modify
label def relation_head 4 "Father/Mother", modify
label def relation_head 5 "Sister/Brother", modify
label values edu_resp edu_resp
label def edu_resp 1 "None", modify
label def edu_resp 2 "Some primary ", modify
label def edu_resp 3 "Primary completed", modify
label def edu_resp 4 "Some secondary ", modify
label def edu_resp 5 "Secondary completed", modify
label def edu_resp 7 "University degree", modify

tempfile survey
save `survey'

drop if relation_head != 1
bysort hhid: assert _N==1
keep hhid *_resp
rename (*_resp) (*_head)
list, noobs
tempfile heads
save `heads'

use `survey', clear
merge m:1 hhid using `heads'
drop _merge
list hhid relation_head gender_resp age_resp gender_head age_head, noobs sepby(hhid)

Code:

. list, noobs

  +--------------------------------------------------+
  | hhid   gender~d   age_head              edu_head |
  |--------------------------------------------------|
  |   11       Male         23   Secondary completed |
  |   12     Female         49                  None |
  |   19       Male         46          Some primary |
  |   22       Male         45   Secondary completed |
  |   28       Male         35   Secondary completed |
  |--------------------------------------------------|
  |   33     Female         36     University degree |
  |   50     Female         19          Some primary |
  +--------------------------------------------------+

Code:

. list hhid relation_head gender_resp age_resp gender_head age_head, noobs sepby(hhid)

  +------------------------------------------------------------------+
  | hhid   relation_head   gender~p   age_resp   gender~d   age_head |
  |------------------------------------------------------------------|
  |   11            Head       Male         23       Male         23 |
  |   11    Son/Daughter       Male         18       Male         23 |
  |------------------------------------------------------------------|
  |   12            Head     Female         49     Female         49 |
  |   12          Spouse     Female         22     Female         49 |
  |------------------------------------------------------------------|
  |   19            Head       Male         46       Male         46 |
  |------------------------------------------------------------------|
  |   22            Head       Male         45       Male         45 |
  |   22          Spouse     Female         30       Male         45 |
  |------------------------------------------------------------------|
  |   28            Head       Male         35       Male         35 |
  |------------------------------------------------------------------|
  |   33            Head     Female         36     Female         36 |
  |   33   Father/Mother       Male         29     Female         36 |
  |------------------------------------------------------------------|
  |   50            Head     Female         19     Female         19 |
  |   50          Spouse     Female         35     Female         19 |
  |   50    Son/Daughter       Male         18     Female         19 |
  +------------------------------------------------------------------+

Comment

Gatelik Tony

Join Date: Jun 2018

Posts: 8
#4

23 Jun 2018, 23:02

Thank you Clyde and William.

I do highly appreciate your assistance in this. However, when I use both processes it return an error of the form:

Code:

by hhid, sort: egen head_count = total(relat_h== 1) assert head_count == 1 4,341 contradictions in 4,352 observations assertion is false assert head_count == 1 4,341 contradictions in 4,352 observations assertion is false

What does this mean and how do I solve it?

I am not good in saving my data in ` tempfile' (I understand it is saved in temporary folder in my window but don't know how to change the path) hence I opted to follow Clyde Stata commands.

Regards,

Gatelik

Last edited by Gatelik Tony; 23 Jun 2018, 23:40.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

24 Jun 2018, 05:07

What this means is that most of your households (distinct values of hhid) have more than one person who is identified as head. In a household with two heads, characteristics like "age of head" are not possible to define.

Looking at the fact that most households have more than one head, is it possible that you have panel data, where each household is interviewed more than once? If so, then there is presumably a variable - which you didn't share with us - that indicates which interview any particular observation came from. Together with hhid, that should better identify your households. If the name of your variable is, for example, survey_year, then where you see

Code:

hhid

in the sample code in posts 2 and 3, you should replace it with

Code:

hhid survey_year

If that is not the case, then instead of the assert command, try

Code:

browse if head_count != 1

to open the Data Browser window displaying the observations for households with multiple heads and, together with the documentation for your data, try to better understand your data.

In the code I provided in post #3, the reason the data is stored in a temporary file is because having read it the sample data I needed to save it somewhere and I chose a temporary file. Your data presumably already exists in a permanent disk file somewhere and you do not need to save it again in a temporary file.

Last edited by William Lisowski; 24 Jun 2018, 05:22.
Comment
Gatelik Tony

Join Date: Jun 2018

Posts: 8
#6

24 Jun 2018, 08:19

Thank you for your response. I did perform data check using - datacheck- available from SSC. The return indicates the problem is emanating from other household members(spouse, children, father/mother and other relatives). I tried to keep observations to only relationship_head==1 but the assert error still exist.

I only have hhid, uniqkey- which is the questionnaire key and the data is a cross-sectional not a panel.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

24 Jun 2018, 08:39

The assert command tells you that there are 4,351 individuals in your dataset, of which 4,341 are in households for which more than one observation has a value of 1 for relat_h. Or, perhaps, 0 observations have a value of 1 for relat_h.

The problem may be because you ran incorrect code.

Based on your sample data in post #1, which shows a variable named relation_head, and no variable named relat_h, Clyde wrote the following

Code:

by hhid, sort: egen head_count = total(relation_head == 1) assert head_count == 1

In post #4 you tell us you ran

Code:

by hhid, sort: egen head_count = total(relat_h== 1) assert head_count == 1

Now, either you checked some variable relat_h that is not the variable that indicates relationship to the head, or the data you presented in post #1 is not representative of your data.

Try replacing the assert command with

Code:

list if head_count!=1

Last edited by William Lisowski; 24 Jun 2018, 08:45.
Comment

Gatelik Tony

Join Date: Jun 2018
Posts: 8

24 Jun 2018, 09:48

I apologize for the confusion of naming. The data is the right one and relation_head is the same as relat_h.

Code:

set more off
/    VERIFY EACH HOUSEHOLD HAS EXACTLY ONE HEAD

by hhid, sort: egen head_count = total(relat_h== 1)
list if head_count!=1

//    CALCULATE AGE AND EDUCATION OF HOUSEHOLD HEAD
foreach x in age_resp edu_resp{
    by hhid, sort: egen `x'_hh = max(cond(relat_h == 1, `x', .))
}

sum age_resp_hh edu_resp_hh head_count

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
 age_resp_hh |      8,654    86.05639    9.311528         28        100
 edu_resp_hh |      8,654    6.744627    .6037007          1          7
  head_count |      8,665    43.98881    15.17873          0         71

The mean age is quite high....

Last edited by Gatelik Tony; 24 Jun 2018, 09:57.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

24 Jun 2018, 14:20

Post #8 raises more questions than it answers.
We now know - from the variable names - that the data you present in post #1 is not the data you used in post #4 and post #8.

We also know that the data in post #4 had 4352 observations, according to the assert command, but the summary command in post #8 shows 8,665 observations.

That's three different sets of data, based on variable names and numbers of observations. What else might be different? I suspect you've made new datasets derived from the dataset you presented in post #1 and in the process made some significant changes that are causing the problems you see, including those reported by the assert command.

One thing we don't know is what the variables uniqid and hhid in the data in post #1 actually identify. You have neglected to tell us, and we have made some assumptions that may be wrong. Thinking of datasets I've used, there was one in which the "household ID" was actually an identifier for the individual within that household. Some other variable actually contained a value that was different for each distinct household.
The material presented in the [CODE] block in post #8 is apparently heavily edited rather than copied and pasted from the Results window. As with your code in post #4, the commands do not show the leading ". " prompt that is displayed in the Stata Results window. The second line is not proper syntax for a comment.

Because you included "set more off" in post #8 I suspect the list command produced extensive output that you have neither shared nor examined to find the source of your values of head_count!=1.

I feel confident in that assertion because the summary results in post #8 show that your values of head_count lie between 0 and 71. You ignore this and comment instead on the mean age of the household head being high!

In post #5 I instructed you on how to look at your data to try to understand why head_count has values other than 1. Apparently you chose instead to do something with the SSC datacheck command. I can assure that that if your households have any number of heads other than 1, the results you have achieved for head's age will be wrong, as I told you in post #5.

Your mean age is high in part because you are calculating the mean age of the heads incorrectly. Imagine a household with 3 members - a 40-year-old head, a 30-year-old spouse, and a 1 year old child. Each of those observations will have a value of 40 for age_resp_hh and so 3 values of "40" will be entered into the mean for this one household head. To correctly compute the mean age of the household heads, you would want

Code:

sum age_resp_hh if relation_head==1

and you can check this by comparing it to

Code:

sum age_resp if relation_head==1

At this point your problem is that you do not understand your data. To understand your data and your results, you must look at your data. We cannot do that for you.

Last edited by William Lisowski; 24 Jun 2018, 14:23.
1 like
Comment

Gatelik Tony

Join Date: Jun 2018
Posts: 8

#10

25 Jun 2018, 04:24

Thank you sir for your help. Am new in this forum hence I might make few mistakes in the best way of posting the data despite reading FAQ #12.

This is my explanation regarding the data I posted

- #1 is a sample of the entire data. The only changes is variable renaming i.e relation_head to relat_h but the data is same throughout.
- However, in order to remove benefit of doubt and since I need help on this let me retain variable name as in post #1
- The data has 8,665 observations in total, what I had done in post #4 was to

Code:

 keep relation_head==1

Code:

ta relation_head

   relationship to |
         household |      Freq.     Percent        Cum.
-------------------+-----------------------------------
              Head |      4,352       50.23       50.23
            Spouse |      2,752       31.76       81.98
      Son/Daughter |        820        9.46       91.45
     Father/Mother |        372        4.29       95.74
    Sister/Brother |        144        1.66       97.40
        Grandchild |         94        1.08       98.49
    Other relative |        113        1.30       99.79
Other non-relative |         18        0.21      100.00
-------------------+-----------------------------------
             Total |      8,665      100.00

- I have followed all your recommendation to the latter. For instance, when I replace assert command with

Code:

list if head_count!=1

- I get the following error

Code:

 clear

. input double(uniqkey hhid gender_resp age_resp relation_head edu_resp)

        uniqkey        hhid  gender_r~p    age_resp  relation~d    edu_resp
  1. 23100715 1 0 80 1 2
  2. 23057547 1 0 31 2 3
  3. 23274081 1 1 44 1 2
  4. 23749048 1 1 55 1 4
  5. 23243503 1 0 24 1 3
  6. 23172349 1 1 43 1 6
  7. 23906022 1 1 45 1 3
  8. 22641796 1 0 30 2 1
  9. 22913024 1 0 25 2 3
 10. 23914304 1 0 34 1 1
 11. 22549839 1 1 34 1 1
 12. 23264638 1 1 46 1 4
 13. 22782580 1 0 40 2 3
 14. 23262017 1 1 61 1 2
 15. 23038080 1 1 20 1 3
 16. 22435662 1 0 37 1 5
 17. 23376548 1 0 66 2 2
 18. 23929260 1 0 81 1 1
 19. 22803692 1 1 22 4 1
 20. 23232463 1 0 35 2 2
 21. 22436483 1 1 21 4 5
 22. 23231202 1 0 50 2 2
 23. 23766268 2 1 40 1 1
 24. 23027864 2 1 23 1 6
 25. 23102007 2 1 66 1 3
 26. 23996659 2 0 37 2 5
 27. 23885477 2 0 17 4 4
 28. 23383315 2 1 37 1 5
 29. 23681554 2 0 40 1 3
 30. 23358174 2 0 29 1 4
 31. 23564467 2 0 60 2 3
 32. 23565050 2 1 25 3 5
 33. 22657283 2 0 24 2 1
 34. 23273920 2 0 30 2 4
 35. 22652368 2 1 21 3 5
 36. 23297018 2 0 33 2 3
 37. 22930350 3 1 36 1 3
 38. 22683962 3 0 41 1 3
 39. 23312313 3 0 44 1 2
 40. 23041366 3 0 16 5 4
 41. 23798868 3 0 46 2 3
 42. 22795777 3 0 32 2 3
 43. 22872566 3 0 35 2 2
 44. 23686088 3 0 29 2 5
 45. 23062772 3 0 16 4 2
 46. 23189013 3 0 50 1 4
 47. 22620672 3 0 33 2 3
 48. end

.
. tempfile survey

. save `survey'
file C:\Users\KDIS\AppData\Local\Temp\ST_01000001.tmp saved

.
. drop if relation_head!= 1
(24 observations deleted)

. list if relation_head!=1

. keep hhid *_resp

. rename (*_resp) (*_head)

. list, noobs

  +---------------------------------------+
  | hhid   gender~d   age_head   edu_head |
  |---------------------------------------|
  |    1          0         80          2 |
  |    1          1         44          2 |
  |    1          1         55          4 |
  |    1          0         24          3 |
  |    1          1         43          6 |
  |---------------------------------------|
  |    1          1         45          3 |
  |    1          0         34          1 |
  |    1          1         34          1 |
  |    1          1         46          4 |
  |    1          1         61          2 |
  |---------------------------------------|
  |    1          1         20          3 |
  |    1          0         37          5 |
  |    1          0         81          1 |
  |    2          1         40          1 |
  |    2          1         23          6 |
  |---------------------------------------|
  |    2          1         66          3 |
  |    2          1         37          5 |
  |    2          0         40          3 |
  |    2          0         29          4 |
  |    3          1         36          3 |
  |---------------------------------------|
  |    3          0         41          3 |
  |    3          0         44          2 |
  |    3          0         50          4 |
  +---------------------------------------+

. tempfile heads

. save `heads'
file C:\Users\KDIS\AppData\Local\Temp\ST_01000002.tmp saved

.
. use `survey', clear

. merge m:1 hhid using `heads'
variable hhid does not uniquely identify observations in the using data
r(459);

end of do-file

r(459);

- I highly appreciate your concern. My conclusion is that after thorough observation I have noticed when I use hhid I get this error particularly for other household members besides the head. I tried using the uniqkey which is the serial number for the questionnaire since I don't have any other identifier and all the codes are working perfectly. Though am also interested in deriving head characteristics from those who responded as related to the household head as either spouse/children etc.

Is it possible to get head characteristics from 4313 observation?

Code:


di  8665-4352
4313

Last edited by Gatelik Tony; 25 Jun 2018, 04:27.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

25 Jun 2018, 04:54

One thing we don't know is what the variables uniqid and hhid in the data in post #1 actually identify. You have neglected to tell us, and we have made some assumptions that may be wrong. Thinking of datasets I've used, there was one in which the "household ID" was actually an identifier for the individual within that household. Some other variable actually contained a value that was different for each distinct household.

You have explained uniqid. How does the documentation for the survey describe the variable hhid?
Comment
Gatelik Tony

Join Date: Jun 2018

Posts: 8
#12

25 Jun 2018, 04:58

hhid is the household identifier/ number according to the documentation.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

25 Jun 2018, 07:30

"household identifier" are two words. What do those words mean? What does the "household identifier" identify? Your documentation - perhaps not the codebook, perhaps other documentation for the survey - should say what its purpose is and how it is to be interpreted.

Apparently the uniqkey identifies each distinct questionnaire. That tells us nothing about how questionnaires correspond to households. Is there one questionnaire per household? Or one per individual?

This is the key to all the problems you have had. The "household identifier" does not identify households, as its name suggests is should. And we don't know if the uniqkey identifies households. You say your code is working perfectly using uniqkey instead of hhid, but what you write suggests it was run only on data for heads of households. The code Clyde provided in post #2 was designed to be run on the entire dataset of heads, spouses, children, etc.

If every individual in a household has the same uniqkey, then uniqkey, not hhid, functions as a "household identifier" that identifies members of the same household.

Your 48 sample observations do not show two individuals with the same uniqkey. Perhaps if you were to

Code:

sort uniqkey hhid list uniqkey hhid relation_head in 1/100

it would be clear if uniqkey identifies households.

If uniqkey does not identify households, you need to find some way of identifying which individuals are members of the same household. There is no avoiding this requirement. You have to know which non-heads are in the same household as the heads. Turn to the survey documentation for advice. My experience is that too many survey users do not take the time to thoroughly read the documentation and examine the data.

If uniqkey does indeed identify members of the same household, then if you apply Clyde's advice from post #2 to your 8665 observations, replacing "hhid" with "uniqkey", you should get what you want. The variables giving the household head's characteristics will be exist on each observation having the same uniqkey.
Comment

Announcement

How to generate household head characteristics

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment