Choose most common observation within group.

Sietze Hylkema

Join Date: Nov 2015

Posts: 4
#1

Choose most common observation within group.

07 Jan 2016, 12:46

Hi!

I am working with a large dataset of households. I want to choose a reference person per household and apply their information to all observations.

Right now I created dummies for the selection variable, counted it and dropped per year the persons from the household with less total observations. This works for 95% of households, except in:
1. If the selected person didn't fill in the survey of a specific year, but someone else from the household did
2. If only 2 years are available, with different persons filling in the survey.

How can I solve those issues? For clarity, I want to keep all observations, but for certain variables I want to change them to the reference person.

Something like : by(House-id): replace Birthyear = 'most common birthyear within household' if Count< largest Count within household

Year House-id Member Birthyear House income Count

1996 21 1 1942 30 3

1997 21 1 1942 31 3

1998 21 3 1980 32 1

1999 21 1 1942 33 3

Thank you in advance!

Last edited by Sietze Hylkema; 07 Jan 2016, 12:50.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35709
#2

07 Jan 2016, 13:05

Classically the most common observation is called the mode, and it's not always well defined. But there is a dedicated egen function.

But when it's well defined it certainly occurs more than once, so that seems quite contradictory to the idea of selecting a single person.

Otherwise put, your question is not clear to me. Perhaps you can provide a worked example of which observation you want to select and why.

Last edited by Nick Cox; 07 Jan 2016, 13:10.
Comment
Sietze Hylkema

Join Date: Nov 2015

Posts: 4
#3

07 Jan 2016, 14:00

I'll try to explain it better. I work with panel data of households. Per household I have income & wealth data. For my control variables I use the age & education level of the 'head of the household' .
However, this person sometimes changes. In the first post in the table, in 1998 the head-of-the-household is suddenly the child for an unknown reason. I want to correct this to the real head-of-the-household. I created Count to identify how often a household member is the head-of-the-household.

Now I want to make am thinking of using: by(Houshold-id) : replace Age = Age(n-1) if [[[ Count < Largest Count within the household]]]

How do I write an if statement that says that person with the smallest count should be replaced?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

07 Jan 2016, 18:58

Based on your sample code in post #3, I think what you want is something like the following, assuming hhid is your household ID variable.

Code:

bysort hhid (Count): replace Age=Age[_N]

This will order the observations by increasing Count within each household, and then within each household replace all the Age values with the value from the final observation in that household - which will be the one with the largest value of Count.

Having said that, you will have a problem is you have a household where there is a tie for the largest value of Count, because in that case the sort is not guaranteed to return the same individual each time. You need to think of a good tie-breaker to include, perhaps something as simple as the individual ID:

Code:

bysort hhid (Count individ): replace Age=Age[_N]
Comment

Year	House-id	Member	Birthyear	House income	Count
1996	21	1	1942	30	3
1997	21	1	1942	31	3
1998	21	3	1980	32	1
1999	21	1	1942	33	3

Announcement

Choose most common observation within group.

Comment

Comment

Comment